APPLICATION  OF  NEURAL  NETWORKS  AND  GENETIC  ALGORITHMS 

IN  HIGH  ENERGY  PHYSICS 


BY 

YOULI  ANDREEV  KANEV 


A DISSERTATION  PRESENTED  TO  THE  GRADUATE  SCHOOL 
OF  THE  UNIVERSITY  OF  FLORIDA  IN  PARTIAL  FULFILLMENT 
OF  THE  REQUIREMENTS  FOR  THE  DEGREE  OF 

DOCTOR  OF  PHILOSOPHY 


UNIVERSITY  OF  FLORIDA 


1998 


ACKNOWLEDGMENTS 


I would  like  to  thank  my  friend  and  former  fellow  student  Dr.  Mohammad 
Reza  Tayebnejad  for  the  numerous  fruitful  discussions  we  have  had  about  issues 
pertaining  to  this  work. 

I am  very  grateful  to  Dr.  Michael  Jones  for  his  reading  the  entire  manuscript 
and  making  valuable  suggestions.  As  a result,  the  style  and  clarity  of  the  exposition 
have  improved  significantly. 

I am  also  indebted  to  John  Bro  for  discussing  with  me  some  fine  grammatical 

points  and  making  numerous  suggestions  that  improved  both  the  style  and  the 

% 

grammar  of  the  text. 

Finally,  and  most  importantly,  I extend  my  sincerest  gratitude  to  my  academic 
advisor.  Prof.  Richard  Field.  His  ability  to  create  an  environment  conducive  to 
independent  thinking  is  difficult  to  match.  At  the  same  time,  his  close  involvement 
in  the  research  projects  played  a major  role  in  the  extent  and  quality  of  this  dis- 
sertation. Only  a rare  person  can  achieve  such  difficult  balance  between  guidance 
and  independence. 


11 


TABLE  OF  CONTENTS 


ACKNOWLEDGMENTS ii 

ABSTRACT vii 

CHAPTERS 

1 INTRODUCTION 1 

2 THE  STANDARD  MODEL 5 

2.1  Quantum  Electrodynamics 6 

2.2  Electroweak  Theory 10 

2.2.1  The  Free-Field  Lagrangian 14 

2.2.2  Introduction  of  the  Massless  Gauge  Fields 19 

2.2.3  Spontaneous  Breaking  of  the  Symmetry 

via  the  Higgs  Mechanism 23 

2.2.4  Coupling  of  (j)  to  the  Leptons  27 

2.2.5  Fixing  the  Gauge  and  Obtaining  the  Electroweak  Lagrangian 

for  Leptons 30 

2.2.6  Extension  to  Quarks 36 

2.3  Quantum  Chromodynamics 44 

3 DATA  ANALYSIS  IN  HIGH  ENERGY  PHYSICS 48 

3.1  Data  Representation  and  Transformation 51 

3.1.1  Events  Characterization 51 

3.1.2  Jets  56 

3.1.3  Event  Topology 57 

3.2  Methods  For  Signal  Enhancement 60 

3.2.1  Linear  Cuts 61 

3.2.2  Fisher  Discriminates 63 

3.2.3  Neural  Networks  66 

4 NEURAL  NETWORKS  68 

4.1  Neural  Networks  and  Machine  Intelligence 69 

4.1.1  Neural  Networks  as  a Model  of  a Nervous  System 70 

4.1.2  Neural  Networks  as  Model- Free  Function  Estimators  ....  78 


111 


4.1.3  Neural  Networks  as  Dynamical  Systems 80 

4.1.4  Neural  Networks  as  Ostensive  Learning  Systems 81 

4.2  Structure  of  Neural  Networks  82 

4.2.1  Network  Topology  83 

4.2.2  Neuronal  Activation  87 

4.2.3  Performance  Measures 89 

4.3  Training  Algorithms  94 

4.3.1  Supervised  and  Unsupervised  Learning 98 

4.3.2  The  Gradient  Descent  Algorithm 100 

4.3.3  The  Backpropagation  Algorithm 104 

4.3.4  Global  Algorithms 113 

5 GENETIG  ALGORITHMS 116 

5.1  Basic  Concepts  in  Genetics 117 

5.1.1  DNA  structure 118 

5.1.2  Gene  Expression  120 

5.1.3  Replication  of  DNA 122 

5.2  Basic  Concepts  in  Biological  Microevolution 125 

5.2.1  Sources  of  Genetic  Variation 126 

5.2.2  Natural  Selection 129 

5.2.3  Social  Behavior 131 

5.3  Maximization  Problems  as  Biological  Populations 132 

5.3.1  Modeling  of  Genetic  Information 134 

5.3.2  Modeling  of  Genetic  Variation 138 

5.3.3  Modeling  of  Microevolution 140 

5.4  Convergence  Factors  in  Genetic  Algorithms 144 

5.4.1  Population  Size 147 

5.4.2  Natural  Selection  (Survival  of  the  Fittest)  148 

5.4.3  Reproduction 151 

5.4.4  Gene  Linkage 162 

5.4.5  Age  163 

5.4.6  Directed  Genetic  Changes  (Embedded  Algorithms) 165 

5.5  Application  to  Finding  Optimal  Linear  Cuts 170 

5.6  Application  to  Training  of  Neural  Networks  177 

6 ENHANCING  THE  HIGGS  BOSON  SIGNAL 185 

6.1  Data  Generation  and  Cuts  Without  Pile-Up 187 

6.1.1  Event  Generation 187 

6.1.2  Lepton  Trigger 189 

6.1.3  Jet-Pair  Selection 190 


IV 


6.1.4  Invariant  Mass  Cuts  193 

6.2  Network  Analysis  Without  Pile-up  195 

6.2.1  Network  Inputs 196 

6.2.2  Network  Structure  and  Training 199 

6.2.3  Network  Performance 200 

6.2.4  Network  Cut-Off 202 

6.2.5  Network  Weighting 204 

6.3  Fisher  Discriminates  Analysis  Without  Pile-Up  205 

6.4  Data  Generation  and  Cuts  With  Pile-Up 206 

6.5  Network  Analysis  With  Pile-Up 211 

6.6  Summary  214 

7 THE  FOUR-JET  DECAY  MODE  OF  TOP 216 

7.1  Data  Generation  and  Cuts 220 

7.1.1  Event  Generation 220 

7.1.2  Lepton  Plus  Missing  Transverse  Energy  Trigger 221 

7.1.3  Calorimeter  Cell  Cuts  223 

7.2  Reconstructing  the  Top-Pair  Invariant  Mass 226 

7.2.1  Reconstructing  the  Neutrino  Momentum 226 

7.2.2  Adding  in  the  Momentum  and  Energy  of  the  Jets 228 

7.2.3  Comparing  With  the  Parton-Parton  Center  of  Mass  Energy  229 

7.3  Modified  Fox- Wolfram  Moments 231 

7.4  Neural  Network  Analysis 236 

7.5  Fisher  Discriminates  Analysis 240 

7.6  Summary  244 

8 THE  SIX-JET  DECAY  MODE  OF  TOP 245 

8.1  Event  Simulation  and  Detection 248 

8.2  Modified  Fox- Wolfram  Moments 248 

8.2.1  Constructing  Fox- Wolfram  Moments  from  Calorimeter  Cells  249 

8.2.2  Constructing  Modified  Fox- Wolfram  Moments  from  Jets  . . 251 

8.3  Multi-Dimensional  Linear  Cuts  and  Genetic  Algorithms 254 

8.4  Reconstructing  the  Top-Pair  Invariant  Mass 257 

8.4.1  Using  the  Calorimeter  Cells  Directly 257 

8.4.2  Using  the  Outgoing  Jets 258 

8.5  Isolating  Multi-Jet  Topologies 259 

8.5.1  Using  Jet  Multiplicity  Cuts  259 

8.5.2  Using  Hi  Cuts  Without  Jets 261 

8.5.3  Using  Hi  Cuts  With  Jets 263 

8.6  Summary  265 


V 


9 TAGGING  OF  TAU  JETS 269 

9.1  Event  Generation  and  Representation  of  Jets 270 

9.2  Tagging  with  Multi-Dimensional  Linear  Cuts 273 

9.3  Tagging  with  Neural  Networks 276 

10  CONCLUSION 280 

APPENDICES 

A NOTATIONS 282 

A.l  Gamma  Matrices 282 

A. 2 The  Levi-Civita  Tensor 283 

B DIRAC  SPINORS 285 

REFERENCES 288 

BIOGRAPHICAL  SKETCH 291 


VI 


Abstract  of  Dissertation  Presented  to  the  Graduate  School 
of  the  University  of  Florida  in  Partial  Fulfillment  of  the 
Requirements  for  the  Degree  of  Doctor  of  Philosophy 

APPLICATION  OF  NEURAL  NETWORKS  AND  GENETIC  ALGORITHMS 

IN  HIGH  ENERGY  PHYSICS 

By 

Youli  Andreev  Kanev 
August  1998 

Chairman:  R.  Field 
Major  Department:  Physics 

We  develop  and  apply  new  methods  for  signal  enhancement  in  high  energy 
physics.  We  use  neural  networks  as  an  ostensive  system  capable  of  recognizing 
patterns  in  input  collider  data,  even  when  the  patterns  are  not  well  defined.  We 
show  that  neural  networks  give  better  signal  enhancement  than  traditional  meth- 
ods. 

In  addition,  we  develop  a genetic  algorithm  that  is  self-adaptive  to  the  param- 
eter space  and  the  stage  of  evolution.  It  can  also  embed  other  algorithms  and, 
therefore,  it  is  an  extension  of  existing  maximization  algorithms. 

We  show  that  certain  problems  in  high  energy  physics  cannot  be  solved  by  using 
local  maximization  algorithms  and  that  the  genetic  algorithm  presented  here  can 
be  applied  to  these  cases  as  well.  We  illustrate  one  such  case  by  using  neural 
networks  trained  by  our  genetic  algorithm. 

We  give  four  specific  applications — Higgs  decay,  four-jet  top  decay,  six-jet  top 
decay  and  tau  jets  tagging — which  show  the  validity  of  these  techniques. 


Vll 


CHAPTER  1 
INTRODUCTION 


The  central  topic  of  this  dissertation  is  the  development  and  application  of 
new  methods  for  signal  enhancement  in  high  energy  physics.  We  will  apply  neural 
networks  to  high  energy  data  to  enhance  the  signal  over  the  background  and  show 
that  neural  networks  are  superior  tools  compared  to  traditional  methods.  We 
will  also  develop  a version  of  genetic  algorithms  that  will  allow  us  to  find  optimal 
solutions  of  complex  maximization  problems.  We  will  show  that  genetic  algorithms 
are  superior  maximization  algorithms  when  the  dimension  of  the  parameter  space 
is  very  high.  In  addition,  we  will  show  that  the  domain  of  applicability  of  genetic 
algorithms  is  larger  than  that  of  traditional  local  algorithms. 

Neural  networks  are  a computer  model  of  the  nervous  system  of  a biological 
individual.  Their  most  important  feature  is  their  ostensive  nature — they  can  rec- 
ognize complex  patterns  without  any  definition  of  those  patterns.  They  can  also  be 
viewed  as  model-free  discriminant  functions.  Unlike  other  model-free  discriminant 
functions,  however,  neural  networks  do  not  fit  the  input  data — they  discover  and 
generalize  the  undefined  patterns  embedded  in  it. 

Neural  networks  are  also  dynamical  systems  during  the  process  of  training. 
Thus  the  study  of  what  neural  networks  can  learn  is  the  study  of  the  asymptotic 
behavior  of  a training  algorithm  which  is  defined  on  its  attractor. 

Genetic  algorithms  are  modeled  after  genetics  and  microevolution.  The  basic 
idea  is  to  represent  a maximization  problem  as  a new  species  and  evolve  an  en- 


1 


2 


tire  population  of  individuals.  The  “fitness”  of  an  individual  reflects  the  desired 
performance  measure  for  the  maximization  problem.  As  the  population  evolves, 
the  fitness  of  its  members  constantly  improves.  Each  individual  in  the  population 
represents  a specific  solution.  The  “fittest”  individual  represents  the  best  solution 
discovered  by  the  algorithm. 

The  most  important  feature  of  genetic  algorithms  is  their  ability  to  be  both 
local  and  global  in  nature.  Local  algorithms  are  often  inadequate  for  maximization 
over  high-dimensional  parameter  spaces.  On  the  other  hand,  global  algorithms  are 
often  too  slow  in  those  cases  to  be  practical.  Genetic  algorithms  combine  the  best 
features  of  local  and  global  algorithms — they  can  be  used  on  high-dimensional 
spaces  and  be  sufficiently  fast  at  the  same  time. 

To  demonstrate  the  validity  of  these  methods,  we  will  apply  them  to  four 
different  problems  in  high  energy  physics.  We  will  apply  them  to  the  decay  of  the 
yet  undiscovered  Higgs  particle  in  order  to  enhance  the  Higgs  signal  compared  to 
all  other  background  processes.  We  will  also  apply  our  methods  to  the  four-jet  and 
six-jet  decay  mode  of  top.  Finally,  we  will  use  neural  networks  trained  by  genetic 
algorithms  to  distinguish  between  tau  jets  and  “ordinary”  quark  jets. 

In  Chapter  2,  we  will  give  an  overview  of  the  Standard  Model  of  fundamental 
interactions.  We  will  concentrate  on  the  derivation  of  the  Lagrangian  of  the  Stan- 
dard Model  based  on  gauge  principles  and  derive  its  final  form  that  unifies  three 
of  the  four  known  forces  in  nature. 

In  Chapter  3,  we  will  present  the  basic  concepts  relevant  to  the  analysis  of 
high  energy  physics  collider  events.  We  will  review  the  most  common  existing 


3 


methods  and  present  a new  way  of  data  representation — the  modified  Fox- Wolfram 
moments. 

In  Chapter  4,  we  will  present  neural  networks  in  detail.  We  will  discuss  how 
they  relate  to  neuroscience  and  machine  intelligence.  In  addition,  we  will  present 
the  major  types  of  neural  networks  and  the  training  algorithms  that  can  be  applied 
to  certain  classes  of  neural  networks.  We  will  give  a critical  analysis  on  the  current 
state  of  neural  network  theory  and  expose  some  of  the  deficiencies  associated  with 
both  network  structure  and  the  training  algorithms  that  are  commonly  used.  We 
will  also  show  that  the  most  commonly  used  neural  networks  and  training  algo- 
rithms cannot  even  in  principle  be  used  in  certain  high  energy  physics  applications 
and  that  the  usage  of  genetic  algorithms  can  remedy  this  deficiency. 

In  Chapter  5,  we  present  in  detail  the  principles  behind  genetic  algorithms  as 
well  as  the  construction  of  the  genetic  algorithms  used  in  this  work.  We  begin  this 
chapter  with  an  overview  of  genetics  and  microevolution.  We  then  discuss  how 
to  represent  maximization  problems  as  populations  of  individuals  as  well  as  the 
major  factors  that  affect  the  convergence  properties  of  genetic  algorithms.  The 
genetic  algorithms  presented  in  this  work  do  not  follow  the  traditional  approaches 
which  are  too  simplistic  and,  generally,  do  not  give  good  results  and  are  too  slow. 
For  example,  our  version  of  genetic  algorithms  uses  sex  explicitly  and  the  genome 
of  individuals  is  extended  to  contain  genes  that  are  dynamical  variables  of  the 
population  evolution.  The  latter  feature  obviates  the  need  for  fine-tuning  of  pa- 
rameters and  makes  the  algorithm  self-adaptive  to  both  the  problem  being  solved 
and  the  stage  of  evolution.  We  conclude  this  chapter  with  two  application  of  ge- 


4 


netic  algorithms — finding  of  optimal  multi-dimensional  linear  cuts  and  training  of 
neural  networks. 

In  Chapter  6,  we  apply  neural  networks  in  conjunction  with  traditional  selection 
cuts  for  the  purpose  of  enhancing  the  Higgs  signal.  We  show  that  neural  networks 
give  a better  enhancement  than  traditional  methods. 

In  Chapter  7,  we  apply  neural  networks  to  the  four-jet  decay  mode  of  the 
top  quark.  We  use  the  modified  Fox- Wolfram  moments  and  show  that  they  are 
an  excellent  representation  of  collider  events  since  neural  networks  give  similar 
performance  compared  to  the  traditional  Fisher  discriminates  (which  are  a very 
particular  case  of  a neural  network). 

In  Chapter  8,  we  use  multi-dimensional  linear  cuts  on  the  modified  Fox- Wolfram 
moments  for  the  purpose  of  enhancing  the  six-jet  decay  mode  of  the  top  quark.  We 
use  genetic  algorithms  to  discover  the  optimal  cuts  on  the  modified  Fox- Wolfram 
moments  and  reduce  the  background  over  the  signal  to  a factor  less  than  100 
without  the  use  of  b-tagging  and  on  purely  topological  grounds. 

Finally,  in  Chapter  9 we  use  multi-dimensional  linear  cuts  and  neural  networks 
both  trained  by  genetic  algorithms  for  the  purpose  of  distinguishing  tau  jets  from 
quark  jets.  The  results  in  this  chapter  are  preliminary,  but  we  are  still  able  to 
show  that  genetic  algorithms  can  be  successfully  used  to  train  neural  networks. 
In  addition,  we  show  once  again  that  neural  networks  are  superior  to  other  tools 
since  they  give  a much  better  performance  compared  to  linear  cuts,  in  this  case. 


CHAPTER  2 

THE  STANDARD  MODEL 


There  are  four  known  forces  in  nature  — the  gravitational  force,  the  electro- 
magnetic force,  the  weak  nuclear  force  and  the  strong  nuclear  force.  The  Standard 
Model  is  a quantum  field  theory  that  unifies  and  explains  the  electromagnetic, 
weak  nuclear  and  strong  nuclear  forces[l,  2,  3].  So  far,  there  are  no  experimental 
results  that  are  inconsistent  with  the  Standard  Model. 

In  its  minimal  form,  the  Standard  Model  contains  the  elementary  particles 
shown  in  Fig.  2.1.  There  are  exactly  3 generations  of  quarks  and  leptons  orga- 


Quarks 

1 

Leptons 

U) 

0 

Gauge  particles 

7, 

W-,  Z,  {H),  g 

Figure  2.1:  Shows  all  elementary  particles  of  the  Standard  Model.  There  are  three 
generations  of  quarks  and  leptons  organized  in  doublets.  The  gauge  particles  carry 
the  strong  weak  and  electromagnetic  interactions. 

nized  in  doublets.  The  gauge  particles  are  carriers  of  the  three  interactions.  The 
photons  (7)  are  responsible  for  the  electromagnetic  interaction.  The  gluons  {g) 
are  responsible  for  the  strong  nuclear  interaction.  The  other  gauge  particles  are 
responsible  for  the  weak  nuclear  interaction.  Except  for  the  Higgs  boson  (H)  and 
the  r neutrino  (ut),  all  other  particles  have  been  observed  experimentally. 


5 


6 


Since  the  Standard  Model  is  a relativistic  theory,  its  Lagrangian,  Csmi  is  invari- 
ant under  the  Lorenz  group  of  transformations.  However,  it  is  also  invariant  under 
the  SU{3)  X SU{2)  x f/(l)  group  of  transformations.  This  intrinsic  symmetry  gives 
rise  to  the  gauge  particles  corresponding  to  the  three  different  interaction  forces. 
The  U{1)  symmetry  gives  rise  to  the  photon  and  corresponds  to  electromagnetic 
interactions.  The  SU{2)  x U{1)  symmetry  gives  rise  to  the  massive  vector  bosons 
W'^,  W~  and  Z,  the  photon  7 as  well  as  the  Higgs  boson,  H.  This  subgroup 
describes  the  electromagnetic  and  the  weak  nuclear  force.  It  is  referred  to  as  the 
electroweak  sector  of  the  Standard  Model.  The  517(3)  symmetry  corresponds  to 
the  strong  nuclear  force  and  gives  rise  to  gluons  which  carry  the  strong  interaction. 

In  this  chapter,  we  will  give  a general  overview  of  the  Standard  Model.  We  will 
only  concentrate  on  the  overall  structure  of  the  Lagrangian.  In  particular,  we  will 
show  that  intrinsic  local  symmetries  lead  to  the  introduction  of  new  (gauge)  fields. 
We  will  also  discuss  the  consequences  of  symmetries  that  are  not  respected  by  the 
ground  state.  These  symmetries  are  known  as  spontaneously  broken  symmetries. 
In  addition,  we  will  illustrate  the  so  called  Higgs  mechanism  which  allows  gauge 
particles  to  be  massive. 


2.1  Quantum  Electrodynamics 

Quantum  electrodynamics  (QED)  is  a quantum  field  theory  that  describes  the 
electromagnetic  interaction  between  charged  particles.  The  Standard  Model  in- 
cludes QED  as  a particular  case. 

In  quantum  field  theory,  a particle  is  represented  by  a field  which  is,  in  general, 
complex.  Let  ^(x)  be  the  field  representing  a charged  particle  where  x = (t,x) 


7 


is  the  four-position  in  Minkowski  space.  If  the  charged  particle  is  an  electron,  for 
example,  the  relevant  free-held  Lagrangian  is  given  by 

Co  = 'll)  - m)xp.  (2.1) 

We  now  consider  the  global  phase  transformation 

^ (2.2) 

Since  the  adjoint  Dirac  spinor,  transforms  as 

'ip{x)  — >■  e®“'0(a:)  (2.3) 

the  free  held  Lagrangian  Co  is  invariant  under  global  phase  transformations.  Since 
phase  transformations  are  the  U{1)  group,  we  say  that  Co  has  a global  U{1)  sym- 
metry. The  physical  interpretation  of  this  symmetry  is  charge  conservation. 

Let  us  now  consider  the  local  phase  transformations 

"0(x)  -4  (2.4) 

and 

'^{x)  — >■  (2.5) 

obtained  by  replacing  o;  with  a{x),  i.e.  the  phase  now  depends  on  the  four-position 
X.  The  mass  term  in  the  Lagrangian  is  still  invariant.  The  hrst  term,  however,  is 


8 


not  since 

9fi'ip{x)  ->  - i{d^a(x))tjj{x)].  (2.6) 

The  term  containing  the  derivative  of  a{x)  makes  the  original  Lagrangian  non- 
invariant under  local  phase  transformations. 

However,  there  is  a way  to  change  the  original  Lagrangian  in  such  a way  that 
local  phase  transformations  leave  the  new  Lagrangian  invariant.  Let  us  introduce 
a compensating  massless  vector  field  A{x)  as  follows 

= + -rn)'il).  (2.7) 

If  the  field  A(x)  transforms  under  local  phase  transformations  as 

A^{x)  A'^(x)  = Af,{x)  + {-)d^a{x)  (2.8) 

then  the  second  term  in  the  equation  above  will  exactly  cancel  the  extra  term 
coming  from  the  derivative  of  the  field  'ip{x).  In  other  words,  Ci  is  invariant  under 
local  phase  transformations;  i.e.,  it  has  U{1)  local  (or  gauge)  symmetry.  It  is 
convenient  to  define  the  covariant  derivative,  D^,  as 

D^  = dfj,  + ieA^  (2.9) 

which  simplifies  the  form  of  Ci.  It  can  be  written  in  terms  of  the  covariant  deriva- 
tive as 


Ci  = ip  - m)  xp. 


(2.10) 


9 


One  can  also  rewrite  the  transformation  of  the  field  derivative  as 

D^ip{x)  ->  (2.11) 

which  explicitly  shows  the  local  U{1)  gauge  invariance  of  C\. 

The  newly  introduced  compensating  (or  gauge)  field  A{x)  couples  to  the  matter 
field  'tp{x)  due  to  the  term  —eA^'ip'y'^'tp  in  Ci.  We  can  identify  this  field  as  a 
representation  of  the  photon  particle.  Since  photons  are  massless,  we  do  not  add 
a mass  term  for  A.  In  fact,  were  we  to  add  such  a term,  U{1)  gauge  invariance 
would  be  broken  by  it.  However,  we  still  need  to  add  to  the  Lagrangian  terms 
containing  derivatives  of  A in  order  to  make  it  a true  dynamical  variable.  The 
simplest  gauge-invariant  term  satisfying  this  condition  is 

Ca  = (2.12) 

where 

F^ii,  = d^Au  - d„A^.  (2.13) 

Combining  C\  and  Ca  leads  to  the  Lagrangian  of  QED: 

Cqed  = ^ - m)  - ^F^,F^‘'.  (2.14) 

We  can  also  rewrite  it  as 

Cqed  = - eA^'ipj^xp  - 


(2.15) 


10 


The  first  term  is  the  free  matter  field.  The  second  term  specifies  that  the  particle 
is  massive  with  mass  m.  The  third  term  is  the  interaction  between  the  matter  field 
il){x)  and  the  gauge  field  A{x).  The  parameter  e is  the  charge  of  the  matter  field 
and  it  is  a measure  of  the  strength  of  interaction  between  'ip  and  A.  This  term  is 
also  called  a current  term  since  it  can  be  written  as  —eA^j^  where  is 

the  current  of  the  matter  field.  The  fourth  term  is  the  free  gauge  field  term.  It 
can  be  interpreted  as  a free  photon. 

It  should  be  emphasized  that  there  is  no  mass  term  for  the  photon  field  which 
is  why  QED  is  U{1)  gauge  invariant.  The  weak  interaction,  on  the  other  hand, 
requires  massive  gauge  fields  which  complicates  matters  significantly  as  will  be 
shown  later. 

The  most  important  feature  of  this  derivation  is  the  fact  that  requiring  a global 
symmetry  to  become  local  forced  us  to  introduce  a new  (gauge)  field.  This  mech- 
anism allows  a natural  way  to  introduce  particles  that  “carry”  the  interaction 
between  matter  fields.  The  same  principles  are  used  to  introduce  the  other  gauge 
particles  in  the  Standard  Model. 


2.2  Electroweak  Theory 

The  electroweak  theory  explains  the  weak  nuclear  force  while  at  the  same  time 
unifying  it  with  the  electromagnetic  force.  For  that  reason,  QED  is  contained 
in  the  electroweak  theory.  The  gauge  symmetry  of  the  electroweak  Lagrangian, 
Cew^  is  SU(2)  X 17(1),  but  the  ground  state  does  not  have  the  same  symmetry; 
i.e.,  SU{2)  X 17(1)  is  a spontaneously  broken  symmetry. 


11 


There  are  two  essential  differences  between  QED  and  the  theory  of  electroweak 
interactions: 

1.  The  number  of  required  gauge  particles  for  the  weak  interaction  is  3.  They 
are  the  W~  and  the  Z vector  bosons. 

2.  All  three  gauge  bosons  are  massive. 

The  first  difference  is  easy  to  deal  with — we  chose  a group  with  more  than  one 
generator  so  that  when  we  require  local  gauge  invariance  we  get  three  gauge  fields 
instead  of  one  as  was  the  case  in  QED.  The  minimal  group  satisfying  this  condition 
is  the  SU{2)  group.  Unlike  the  U{1)  group,  SU{2)  is  a non-abelian  group,  but  this 
can  be  dealt  with  relatively  easily.  Yang  and  Mills  are  the  first  ones  who  applied 
local  gauge  invariance  to  non-abelian  groups^  and  thus  such  theories  are  called 
Yang-Mills  theories. 

The  second  difference,  the  existence  of  mass  for  the  gauge  bosons,  is  a very 
serious  problem.  As  we  pointed  out  in  the  previous  section,  a mass  term  for  the 
photon  field  would  violate  the  local  17(1)  gauge  invariance.  This  remains  the  case 
for  SU{2)  as  well  and,  as  a consequence,  we  cannot  follow  the  QED  procedure 
exactly  when  constructing  the  electroweak  Lagrangian.  A naive  “solution”  would 
be  to  not  require  local  gauge  invariance  and  just  write  the  mass  terms  for  the  three 
massive  vector  bosons.  Unfortunately,  such  theory  has  been  shown  to  be  non- 
renormalizable;  i.e.,  it  is  meaningless  since  we  cannot  compute  physical  quantities. 

The  solution  to  this  problem  can  be  summarized  as  follows: 


^More  specifically,  to  the  SU{2)  group. 


12 


1.  Construct  a free  field  Lagrangian,  £o»  for  the  leptons  and  the  neutrinos  which 
is  SU{2)l  X U{1)y  globally  gauge  invariant^.  This  step  is  similar  to  the  first 
step  in  constructing  the  QED  Lagrangian  when  we  considered  a free  massive 
charged  particle.  However,  we  will  be  forced  to  set  the  lepton  masses  to  zero 
in  Co  in  order  to  have  SU{2)l  x U{1)y  local  gauge  invariance  when  we  go  to 
the  second  step.  Needless  to  say,  unless  the  leptons  “acquire”  mass  at  some 
later  point  in  the  procedure,  the  theory  would  be  worthless  since  leptons  are 
massive. 

2.  Introduce  a covariant  derivative  in  a fashion  similar  to  the  U{1)  case 
in  QED  and  construct  a new  Lagrangian,  Ci,  which  is  written  in  terms  of 
the  covariant  derivative.  Due  to  the  covariant  derivative,  Ci  contains  four 
massless  gauge  fields — one  of  them  comes  from  the  U{1)y  subgroup  and  the 
other  three  come  from  the  SU{2)l  subgroup  of  SU{2)l  x U{1)y-  Then  we 
require  local  SU{2)l  x U{1)y  gauge  invariance  of  C\  which  gives  us  the 
transformation  properties  of  all  four  gauge  fields.  It  is  at  this  stage  that  the 
masses  of  the  leptons  must  be  zero.  In  other  words,  the  lepton  mass  terms 
—m'tpip  would  break  local  gauge  invariance.  As  in  the  U{1)  case  of  QED,  we 
cannot  add  mass  terms  for  any  of  the  four  gauge  fields  since  they  too  would 
break  local  gauge  invariance. 

3.  Construct  another  locally  SU{2)l  x U{1)y  gauge  invariant  Lagrangian, 

by  adding  to  Ci  a new  scalar  massive  field,  <t>{x),  which  is  a doublet  under 
SU{2)l  and  has  a potential  V^(^)  that  makes  the  vacuum  expectation  value 
of  4>  non-zero  and  degenerate.  As  a consequence,  the  ground  state  of  £2  is 

^The  meaning  of  the  group  subscripts  will  be  clarified  later. 


13 


not  SU(2)l  X U{1)y  invariant,  while  at  the  same  time  the  Lagrangian  £2 
is  SU{2)l  X U{1)y  invariant.  In  other  words,  we  spontaneously  break  the 
SU{2)l  X U{1)y  symmetry.  The  procedure  of  introducing  the  new  field  <p{x) 
is  called  the  Higgs  mechanism.  The  procedure  of  choosing  a potential  V 
that  results  in  the  ground  state  not  respecting  the  local  symmetry  of  the 
Lagrangian  is  called  spontaneous  symmetry  breaking. 

Due  to  the  covariant  derivative  in  the  kinetic  term  of  the  field  4>  (which  is 
required  by  local  gauge  invariance),  it  interacts  with  the  four  (still  massless) 
gauge  fields.  At  a later  stage  in  the  procedure,  this  interaction  will  result 
in  the  redefinition  of  the  four  massless  gauge  fields  into  three  massive  and 
one  massless  gauge  field.  The  massless  field  will  then  be  identified  with  the 
photon  field  of  QED.  In  addition,  we  will  choose  a potential  V{(j))  that  results 
in  self-interaction  of  </>. 

4.  Construct  yet  another  locally  SU{2)l  x U{1)y  gauge  invariant  Lagrangian, 
£3  = £2  + jC^Yuk,  where  CYuk  contains  terms  coupling  the  field  <f)  to  the 
leptons.  These  terms  are  called  Yukawa  couplings  and  they  are  responsible 
for  making  the  electron,  the  muon  and  the  tau  massive  while  at  the  same 
time  preserving  the  “masslessness”  of  the  neutrinos. 

5.  Finally,  chose  a specific  gauge,  called  the  unitary  gauge,  that  makes  one  of 
the  components  of  (j)  zero  and  “shift”  the  other  component  by  its  vacuum 
expectation  value,  thereby  obtaining  a new  field  h{x).  Rewriting  £3  in  terms 
of  h{x)  yields  the  Lagrangian  of  electroweak  interactions  for  leptons,  C^ew- 
In  this  Lagrangian,  there  are  three  massive  gauge  fields  corresponding  to 


14 


the  W'^,  W~  and  the  Z bosons  and  one  massless  field  corresponding  to  the 
photon.  The  electron,  the  muon  and  the  tau  are  all  massive  in  and  all 
neutrinos  are  massless. 

In  addition,  the  field  h{x),  called  Higgs,  is  a new  massive  scalar  field  which 
must  exist  in  nature  if  the  Standard  Model  is  an  adequate  theory.^  The  Higgs 
boson  couples  to  the  three  massive  leptons,  the  three  massive  gauge  fields 
and  to  itself. 

6.  Extend  the  Lagrangian  for  leptons  to  include  quarks  as  well  by  repeating 
the  first  five  steps.  The  significant  difference  between  leptons  and  quarks 
is  that  all  quarks  are  massive.  As  a consequence,  the  Yukawa  couplings  for 
quarks  will  not  give  fields  of  well  defined  mass.  After  transforming  these 
fields  so  that  the  mass  matrix  is  diagonal,  the  Lagrangian  can  be  written 
in  terms  of  the  physical  quark  terms.  This  transformation  will  cause  the 
charged  currents  to  contain  the  flavor  mixing  CKM  matrix. 

We  will  now  apply  the  procedure  outlined  above  and  give  the  details  of  con- 
structing the  electroweak  Lagrangian,  Cew- 

2.2.1  The  Free-Field  Lagrangian 

Our  objective  is  to  first  write  the  free  field  Lagrangian,  Cq,  which  describes  free 

leptons.  To  simplify  notations,  we  will  only  consider  the  first  generation  of  leptons, 

®Until  the  Higgs  boson  is  discovered  experimentally,  the  Standard  Model  is,  strictly  speaking, 
a hypothesis.  The  Higgs  mechanism  (which  introduced  the  Higgs  particle)  is  one  specific  way  to 
introduce  masses  to  gauge  fields  and  leptons.  There  may  be  another  mechanism  which  is  equally 
satisfactory  from  a phenomenological  point  of  view.  Within  the  next  decade,  we  will  probably 
know  whether  Higgs  exists  or  not.  In  the  latter  case,  the  Standard  Model  would  have  to  be 
discarded  in  its  current  form. 


15 


the  electron,  e,  and  its  neutrino,  i/g,  since  the  procedure  is  trivially  generalizable 
to  all  three  generations. 

The  essential  differences  from  the  free-field  Lagrangian  of  QED  are  as  follows: 

1.  The  neutrinos  are  left-handed,  while  the  massive  leptons  have  both  left- 
handed  and  right-handed  components. 

2.  We  cannot  put  a mass  term  for  the  electron  because  it  will  break  local 
SU{2)l  X U{1)y  gauge  invariance. 

To  account  for  the  first  difference,  we  define  the  SU{2)  doublet 


(2.16) 


(which  is  also  called  an  isospin  doublet)  that  consists  of  the  left-handed  components 
of  the  electron  and  its  neutrino.  Since  the  electron  has  a right-handed  component 
as  well,  we  also  define  the  SU{2)  singlet 


R(a;)  = eR{x)  (2-17) 

that  is  simply  the  right-handed  component  of  the  electron.  The  relationship  be- 
tween e£,(a;),  eR{x)  and  e{x)  is  given  by 

= ^(l-75)e(a:) 

= ^(l  + 75)e(a;). 


ex,  (a;) 

eii(a;) 


(2.18) 

(2.19) 


16 


We  can  now  write  the  free-field  Lagrangian  in  terms  of  L(rr)  and  R(x)  as  follows 


£o  = L{x){i'y'"df,)L{x)  + K{x){iYdfj,)R{x) 


(2.20) 


which  can  also  be  written  explicitly  as 


>Co  = {i^eL{x),eL{x)){i'Y^du) 


^ l>eL{x)  ^ 


+ eR{x)i'^^dy,eR{x). 


(2.21) 


The  intrinsic  global  symmetry  of  this  Lagrangian  is  the  group  SU{2)r  x U{\)y- 
Since  it  is  a direct  product  of  two  groups,  we  can  show  this  symmetry  by  considering 
SU{2)  and  U{1)  separately. 

The  transformations  corresponding  to  the  SU  (2)  group  are  given  by 


U = exp  • aj 


(2.22) 


where  r are  the  Pauli  matrices,  U G SU (2)  and  obey  the  following  relations: 


UU^  = 1, 


detZY  = 1. 


(2.23) 

(2.24) 


The  isospin  doublet,  L(a;),  transforms  under  SU{2)  as 


L(a:)  — >■  Z^L(a;) 
tj{x)  — > 1j{x)U^ 


(2.25) 

(2.26) 


17 


and  the  isospin  singlet  transforms  as 


R(a;)  — >•  R(a;)  (2.27) 

R(a;)  R(a;).  (2.28) 


From  the  equations  above  it  is  clear  that  Cq  is  SU(2)  globally  invariant.  Since 
only  the  left-handed  fields  are  affected  non-trivially  by  SU{2),  this  group  is  labeled 
SU (2) L in  the  Standard  Model. 

To  show  U{1)  invariance,  we  first  note  that  both  the  left-handed  and  right- 
handed  fields  are  separately  invariant  under  global  U{1)  transformations.  However, 
they  must  be  considered  together  to  make  sure  that  their  contributions  to  the  terms 
in  the  Lagrangian  (including  later  Lagrangians)  are  the  same.  If  they  were  not,  the 
phase  transformation  on  the  left-handed  field  would  simply  be  a particular  case  of 
the  SU{2)l  transformation. 

For  that  reason,  we  form  a triplet,  rp{x),  of  all  fermions  as  follows 


ip{x)  = 


^eL\x) 

\ ^r{^)  y 


and  consider  U{1)  transformations  of  the  form 


f ( 

l^eL{x) 

1p{x)  = 

^l{x) 

eiix) 

eR(x)  ^ 

(2.29) 


(2.30) 


18 


where 


Y = 


2/l  0 0 ^ 

0 ?/L  0 

0 0 y 


(2.31) 


and  yi  and  yn  are  some  fixed  numbers.  One  of  these  numbers  can  always  be  chosen 


arbitrarily  since  it  would  only  shift  the  phase  of  the  other.  We  chose 


(2.32) 


In  other  words,  we  are  considering  the  pair  of  U(l)  transformations 


L(a;) 

(2.33) 

R(a:) 

(2.34) 

such  that  eventually  the  yi  number  will  be  chosen  to  make  the  final  Lagrangian 
17(1)  symmetric^.  In  operator  form,  we  can  write  the  U{1)  transformation  as 


Uy  = 


(2.35) 


where  x is  a fixed  (global)  phase. 

Since  is  U{1)  globally  symmetric,  there  is  a conserved  charge.  Due  to  Eq. 
2.31,  however,  the  charge  of  L(a:)  is  different  from  the  charge  of  R(a;).  On  the 

other  hand,  the  electron  has  one  of  its  components  in  L(x)  and  the  other  in  R(x). 

^This  can  always  be  done  since  the  value  oi  yi  is  completely  arbitrary  at  this  point.  Later  it 
will  be  multiplying  various  terms  in  the  Lagrangian.  It  is  only  at  that  point  that  its  value  can 
be  chosen. 


19 


For  that  reason,  the  U{1)  group  in  SU{2)l  x U{1)y  cannot  possibly  be  the  U{1) 
group  of  QED  (which  we  will  denote  as  U{1)qed)- 

The  conserved  charge  corresponding  to  the  U{1)  transformations  above  is  called 
hypercharge  and  denoted  by  Y,  hence  the  subscript  of  17(1)  in  SU{2)l  x U{1)y- 
It  is  related  to  the  charge,  Q,  of  QED  as  follows 

Q = I3  + iy  (2.36) 

where  I3  is  the  third  component  of  the  weak  isospin  generator  of  SU{2)l-  As  can 
be  seen  from  the  above  equation,  the  charge  Q,  which  is  the  generator  o^U{l)QED, 
is  related  to  both  the  U{1)y  and  the  SU{2)l  group.  As  will  become  clear  later, 
the  QED  gauge  group  U{1)qed  is  a subgroup  of  SU{2)e  x U{l)y.  This  is  the 
reason  why  weak  interactions  cannot  be  described  separately  from  electromagnetic 
interactions.  In  particular,  had  we  postulated  from  the  very  beginning  that  the 
Lagrangian  for  weak  interactions  be  SU{2)  gauge  invariant  only  (vs.  SU{2)l  x 
U{1)y),  at  later  stages  we  would  find  it  impossible  to  construct  the  weak  theory 
or  to  separately  introduce  QED.  In  the  first  case,  we  would  not  be  able  to  obtain 
massive  gauge  bosons,  for  example.  In  the  second  case,  the  lepton  mass  terms  in 
QED  would  break  the  SU (2)  invariance  of  the  weak  interaction. 

2.2.2  Introduction  of  the  Massless  Gauge  Fields 

The  next  step  in  the  procedure  is  to  require  local  SU{2)l  x U{1)y  gauge  in- 
variance in  order  to  introduce  the  gauge  fields  in  a “natural”  way.  In  other  words. 


20 


we  now  require  that  the  SU{2)i  transformation  be 

Ul{x)  = exp  Qr  • , (2.37) 

and  that  the  U{1)y  transformation  be 

Uy  = (2.38) 

As  was  the  case  in  QED,  Cq  is  not  locally  gauge  invariant  due  to  the  deriva- 
tives. But  if  we  replace  them  with  covariant  derivatives,  then  we  can  obtain  a new 
Lagrangian,  Ci,  which  is  locally  invariant,  i.e. 

Cl  = L{x){ij'^D^)L{x)  -I-  R(a;)(i7'^i)^)R(a;).  (2.39) 

Since  SU{2)l  x U{1)y  is  a product  of  two  groups,  the  covariant  derivative  has 
the  following  structure; 

— df^  + + G^  (2.40) 

where  Gj^  is  the  gauge  term  from  the  SU(2)l  group  and  GT  is  the  gauge  term  from 

the  U{1)y  group.  The  term  G^  has  exactly  the  same  structure  as  QED  since  it 
originates  from  the  U{1)y  group.  However,  since  the  corresponding  gauge  field  is 
not  the  photon  field,  as  already  discussed,  we  give  it  a different  notation,  namely 
B(x).  In  this  notation,  G^  is 


gl  = ig'B^(x)Y 


(2.41) 


21 


where  g'  is  a coupling  constant  analogous  to  the  electron  charge  in  QED  and  Y is 
the  3x3  matrix  introduced  earlier. 

The  term  contains  three  gauge  fields  which  we  denote  by  W“  with  a = 1, 2, 3. 
Its  form  is 

Gi  = igWl{x)Ta  (2.42) 

where  g is  another  coupling  constant  and  the  three  matrices  Ta  are  given  by 


(2.43) 


The  Y and  Tq  matrices  form  a representation  of  the  generators  of  the  group 
SU{2)l  X U{1)y-  They  obey  the  following  commutation  rules: 


[Ta,  Tft]  — iCabcT^, 


[Ta,Y]  = 0 


(2.44) 

(2.45) 


where  tabc  are  the  structure  constants  of  the  SU{2)l  group. 

When  we  substitute  the  above  gauge  terms  in  the  covariant  derivative,  we 
obtain 

D^  = d^  + ig-Wl{x)Ta  + ig'B^{x)Y.  (2.46) 

As  was  the  case  in  QED,  we  now  have  to  add  the  kinetic  terms  corresponding 
to  these  four  new  gauge  fields  in  order  to  make  them  dynamical  variables  of  the 


22 


theory.  Their  minimal  form  consistent  with  gauge  invariance  is 

for  the  B field 

-|Tr(W“^(a;)W^‘'(a:))  for  the  W fields 


(2.47) 


where 

= dfjB^  - d^Bf,  (2.48) 

and 

W“,  = - geabcW»^Wt.  (2.49) 

When  we  add  the  above  kinetic  terms  to  the  already  gauge  invariant  Lagrangian 
£i,  we  obtain  its  final  form: 


£i  = L{x){iYD^)L{x)  + R{x){ij>^D^)R{x)  (2.50) 

- irr(w;„(x)wr(i)).  (2.51) 

If  we  substitute  the  covariant  derivative,  we  obtain  a more  explicit  form: 

£i  = L(a:)(*7^a^  + i5W“(a:)T„  + ig'B^{x)Y)L{x)  (2.52) 

+R(a:)(i7^a^  + i^W“(a;)T„  + i^'B^(x)Y)R(a:)  (2.53) 

-\b,,(x)B‘-'(x)  - iTr(W;„(i)Wr(l)).  (2.54) 


The  problems  with  this  Lagrangian  are  that  it  describes  massless  leptons  only 
and  that  the  W gauge  fields  that  came  out  of  SU{2)i  are  massless  as  well.  If 
it  were  not  for  these  two  problems,  the  Standard  Model  would  have  been  much 


23 


simpler  to  construct  since  this  would  be  its  final  Lagrangian.  However,  we  know 
experimentally  that  some  of  the  leptons  are  massive  and  that  the  gauge  particles 
for  the  weak  interaction  are  massive  as  well.  The  recovery  of  the  lepton  masses  (in 
the  first  case)  and  the  introduction  of  masses  for  the  weak  gauge  particles  (in  the 
second)  are  the  essence  of  the  Standard  Model. 

2.2.3  Spontaneous  Breaking  of  the  Symmetry 
via  the  Higgs  Mechanism 

We  cannot  “manually”  introduce  mass  terms  in  C\  for  the  leptons  or  for  the 
W bosons  for  at  least  the  following  reasons: 

1.  The  mass  terms  would  break  the  local  SU{2)l  x U{1)y  invariance. 

2.  Even  if  they  did  not  break  the  gauge  invariance,  we  would  still  not  have  a 
photon  field  since  the  U{1)y  gauge  field  B is  not  the  photon  field. 

The  solution  of  this  problem  in  the  Standard  Model  is  to  manually  introduce 
yet  another  field,  (p{x),  to  C\  which  is  a massive  scalar  field  that  respects  the 
SU{2)l  X U{1)y  local  symmetry.  This  field  should  explicitly  break  the  SU(2)l  x 
U{\)y  invariance  of  the  vacuum  state  since  only  in  that  case  can  we  form  massive 
gauge  bosons  as  well  as  recover  the  masses  of  the  leptons. 

The  Higgs  Mechanism 

The  introduction  of  such  a field  is  called  the  Higgs  mechanism.  Let  the  new 
field  be  denoted  by  (j){x).  Then  we  can  form  the  new  Lagrangian,  £2?  by  adding 


24 


the  4>{x)  field  terms  to  Ci  as  follows: 


C2  = Ci  + (2.55) 

where  contains  the  terms  relevant  to  the  field  (j){x).  The  most  general  form  of 
C4,  is  given  by 

- V{^)  + /iW-  (2.56) 

The  first  term  is  the  kinetic  energy  of  the  field.  The  second  term  is  the  potential 
energy  of  the  field  and  the  third  term  specifies  that  the  field  (f>{x)  is  massive.  At  this 
point,  we  must  require  that  < 0 in  order  to  obtain  any  non-trivial  consequences. 
This  requirement  by  itself  makes  the  field  4>  a non-physical  field  since  we  do  not 
know  what  an  imaginary  mass  means.  We  will  show  later,  however,  that  the 
resulting  Lagrangian  can  be  rewritten  in  terms  of  physical  fields  only  and  that  it 
is  SU{2)l  X U{\)y  locally  invariant. 

To  conclude  the  introduction  of  the  (j)  field  into  we  must  specify  a specific 
form  of  the  potential  V{(f)).  The  simplest  choice  is  to  require  that 


V(«  = |A|(,AV)".  (2.57) 

In  that  case,  the  modified  Lagrangian,  C2,  becomes 

£2  = L{x){iYD^)L{x)  + R{x){ijf^D^)R{x)  (2.58) 

- iTr(W;,(i)Wr(i))  (2.59) 

+(D<‘4,y(D,4>)  - |A|  (2.60) 


25 


Table  2.1:  The  quantum  numbers  of  the  (f)  field. 


T n Y Q 

4>2{x) 

1 i 1 1 

2 2 2 

1 _i  i 0 

2 2 2 ^ 

We  now  have  to  chose  whether  the  field  </>  is  a singlet,  doublet,  etc.  under  the 
weak  isospin  group  SU{2)l.  The  simplest  choice  that  eventually  gives  the  desired 
final  results  is  for  (/» to  be  a doublet  of  two  complex  fields  <f>i  and  (f)2,  i.e. 


( A.(  \\ 

y M^)  j 


The  quantum  numbers  of  (/»  are  shown  in  Table  2.1. 


(2.61) 


Spontaneous  Symmetry  Breaking 

Due  to  the  negative  value  of  in  the  mass  term  of  the  vacuum  expectation 
value  of  the  field  (j)  is  not  zero!  It  is  given  by 

< (j)  >2=  -/xVlA|  = V (2.62) 

where  we  have  denoted  with  v the  square  of  the  vacuum  expectation  value  of 
0.  In  other  words,  the  ground  state  does  not  have  the  same  symmetry  as  the 
Lagrangian.  The  introduction  of  the  (j)  field  resulted  in  a spontaneous  symmetry 


26 


breaking  of  the  “original”  SU{2)l  x U{1)y  symmetry®.  However,  the  ground  state 
is  still  invariant  under  the  subgroup  U{1)qed  generated  by  Q = T3  + |Y.  This 
will  cause  the  electromagnetic  field  A to  remain  massless  in  the  final  Lagrangian. 
We  can  summarize  the  spontaneous  symmetry  breaking  as 


SU{2)l  X U{1)y  ^ U{1)qed 


(2.63) 


to  emphasize  that  the  ground  state  is  U{1)qed  invariant. 

Furthermore,  the  ground  state,  (po,  is  degenerate  since  it  can  have  an  arbi- 
trary orientation  in  isospin  space.  This  follows  from  the  fact  that  we  can  always 
parameterize  it  as  follows 


(po{0)  = e 


i{T/2)6 


(2.64) 


where  0 is  a vector  in  isospin  space.  Each  one  of  these  equivalent  ground  states  is 
not  invariant  under  SU{2)l  including,  for  example,  the  configuration 


(2.65) 


^Let  us  emphasize  that  when  we  are  talking  about  “spontaneous  breaking”  of  a symmetry,  we 
are  comparing  two  different  Lagrangians.  The  original  Lagrangian,  £oj  is  not  even  locally  gauge 
invariant.  The  second  Lagrangian,  £i,  is  gauge  invariant  and  the  ground  state  exhibits  the  same 
symmetry.  The  third  Lagrangian  we  introduced,  £2  (after  adding  the  non-physical  field  cj))  is 
still  SU{2)l  X U{1)y  gauge  invariant,  but  its  ground  state  is  not.  In  other  words,  spontaneous 
symmetry  breaking  refers  to  the  procedure  of  comparing  two  different  Lagrangians  in  which  the 
first  one  has  the  same  symmetry  for  its  ground  state  as  the  Lagrangian  itself,  and  the  second 
one  has  the  same  Lagrangian  symmetry  as  the  first  Lagrangian,  but  its  ground  state  is  “less” 
symmetric. 


27 


We  conclude  the  introduction  of  the  (j)  field  by  substituting  the  covariant  deriva- 
tives in  £2  to  show  the  explicit  form  of  our  latest  Lagrangian: 

£2  = + i^'B^(a:)Y)L(a:)  (2.66) 

-hR(x)(z7^a^  -f  igWl{x)Ta  + ifi''B^(a;)Y)R(a:)  (2.67) 

- ^Tr(W;„(i)Wr(i))  (2.68) 
+((9„  + isW;(i)T,  + ij'B„(i)Y)(»)t((a„  + !9W;(i)T.  + VB„(i)Y)<A)  (2.69) 

— |A1  (2.70) 

As  can  be  seen  from  this  explicit  form,  the  (f)  field  interacts  with  all  four  gauge 
bosons  due  to  the  covariant  derivative  in  its  kinetic  term.  This  interaction  will 
result  in  “emitting”  mass  terms  for  three  of  the  four  gauge  bosons,  thereby  solv- 
ing the  problem  with  massive  gauge  particles.  The  Lagrangian  cannot,  however, 
introduce  masses  for  the  leptons  since  there  is  no  interaction  between  the  4>  field 
and  the  L and  R leptonic  fields. 

2.2.4  Coupling  of  ^ to  the  Leptons 

Let  us  recall  that  when  we  required  local  SU{2)l  x U{1)y  gauge  invariance,  we 
encountered  two  problems: 

1.  The  gauge  fields  cannot  be  massive  since  the  mass  terms  break  the  gauge 
invariance. 

2.  The  matter  fields  cannot  be  massive  either  since  their  mass  terms  break  the 


invariance  as  well. 


28 


As  already  mentioned  above,  our  latest  Lagrangian  is  already  “populated”  with 
terms  that  will  solve  the  first  problem — giving  mass  to  three  of  the  gauge  bosons. 
The  second  problem,  however,  is  not  yet  addressed. 

The  procedure  for  recovering  the  lepton  masses  consists  of  the  manual  introduc- 
tion of  yet  another  term  in  the  Lagrangian  that  couples  the  (j)  field  to  the  leptons. 
These  interactions  will  then  recover  the  lepton  masses  in  a way  similar  to  that  of 
the  gauge  bosons. 

In  other  words,  we  extend  our  latest  Lagrangian,  C2,  to 

^3  = ^2  + ^Yuk  (2-71) 

where  Cyuk  contains  the  terms  that  couple  4>  to  the  matter  fields.  This  portion  of 
the  Lagrangian  is  referred  to  as  the  Yukawa  sector  and  the  terms  that  it  contains 
are  called  Yukawa  terms. 

The  simplest  form  of  such  interactions  that  is  consistent  with  the  517(2) £,  x 
U{1)y  gauge  invariance  is 


^Yuk  — ~Ce  L(/>R^ 


(2.72) 


where  Cg  is  a parameter  of  the  theory.  Recall  that  when  we  considered  U{1)y  gauge 
invariance  there  were  two  fixed  numbers,  and  yR.  We  chose  yn  — —1,  but  left 
the  other  number  undetermined.  It  is  at  this  point  that  we  can  choose  the  right 
value  for  yi,.  It  can  be  shown  that  demanding  17(1  )y  gauge  invariance  of  Cyuk 
implies 


1 

2 


(2.73) 


29 


At  this  stage,  the  full  form  of  our  latest  Lagrangian,  £3,  is 


£3  = + ig^l{x)T,  + ig'B^{x)Y)L{x)  (2.74) 


+R{x){ij^^d^  + i^W“(o;)T„  + ig'B^{x)Y)Ilix)  (2.75) 

- iTr(W;,(i)Wr(i))  (2.76) 
+((S„  + i9W;(i)T„  + ig'B,{x)Y)4’)\(d^  + igW;{x)T,  + VB„(x)Y)<A)  (2.77) 


A|  (2.78) 

-Ce  (R</»^L  + L(/)R)  . (2.79) 


The  transformation  properties  of  all  fields  are  as  follows: 


• SU{2)  gauge  transformations 


• U{1)  gauge  transformations 

W^{x)  ^W^{x), 

Bf,{x)  B^,{x)  - ^dnx{x), 

L(a:)  -> 


(2.81) 


30 

R(o;)  ^ 


where  U{x)  = G SU{2)  and  a = 1, 2, 3. 

In  principle,  we  could  call  £3  the  Lagrangian  of  the  electroweak  interactions 
for  leptons  since  we  will  not  be  adding  any  more  terms.  However,  it  is  expressed 
in  terms  of  the  non-physical  field  4>  and  it  is  not  obvious  at  all  that  it  contains 
three  massive  gauge  bosons,  one  massless  gauge  boson  and  that  the  leptons  do 
in  fact  have  masses.  All  of  the  above  statements  are  true,  but  not  visible.  Af- 
ter proper  manipulation  of  £3  we  will  rewrite  the  Lagrangian  in  a different  form 
which  contains  physical  fields  only.  This  will  be  the  Lagrangian  of  the  electroweak 
interactions  for  leptons. 


2.2.5  Fixing  the  Gauge  and  Obtaining  the  Electroweak  Lagrangian  for 
Leptons 

Let  us  recall  that  the  </>  field  cannot  be  a physical  field  since  its  mass  is  imag- 
inary. In  this  final  step  of  the  procedure,  we  will  rewrite  (p  so  that  it  contains 
one  physical  field.  When  we  substitute  for  (p  in  our  latest  Lagrangian,  £3,  we  will 
have  achieved  our  objective  of  constructing  the  electroweak  Lagrangian,  C^^wi 
leptons.  The  superscript,  I,  indicates  that  we  have  considered  leptons  only. 


31 


We  begin  by  choosing  a gauge  that  simplifies  the  structure  of  the  (j)  isospin 
doublet.  It  can  be  shown  that  we  can  always  choose  a gauge  Ul  in  such  a way  that 


UL{x)(f)(x)  = 


0 ^ 


V2 


(2.82) 


This  choice  of  gauge  is  called  unitary  gauge.  The  new  field,  p{x),  is  real  and 
satisfies 


p{x)  > 0. 


(2.83) 


Its  vacuum  expectation  value  is 


< 0|p(a;)|0  >=  V 


(2.84) 


and  the  vacuum  expectation  value  of  0 is 


< O|0(a:)|O  >= 


0 ^ 


1 


V75V 


(2.85) 


We  now  introduce  a new  field  h{x)  by  shifting  p(x)  by  its  vacuum  expectation 


value,  i.e. 


h{x)  = p{x)  — V. 


(2.86) 


The  vacuum  expectation  value  of  h{x)  is 


< 0|/i(a;)|0  >=  0 


(2.87) 


32 


which  is  an  immediate  consequence  of  its  definition. 

The  field  h{x)  is  the  Higgs  field  of  the  Standard  Model.  The  fact  that  its 
vacuum  expectation  value  is  zero  is  very  significant  since  a Lagrangian  using  h{x) 
is  renormalizable.  For  that  reason,  we  will  rewrite  the  latest  Lagrangian,  £3,  in 
terms  of  the  field  h{x)  and  obtain  the  final  form  of  the  electroweak  Lagrangian  for 
leptons.  In  other  words,  we  will  use  the  following  substitution 


<j){x) 


0 


\ 


^ ^ (u  + h{x))  ^ 


to  obtain 


(2.88) 


(f){x) 


0 


\ 


\ 


^{v  + h{x))  y ^ 


(2.89) 


After  some  algebra,  we  arrive  at  the  final  form  of  the  Lagrangian 


r} 

^EW 


^Tr  + UeLiYd,v,L  + 


1 1 + 


h 


h 


V 


1 

2 ^ 


+ I 1 H — 


h 

V 


1 


mgCe  1 1 + - 1 + -dfihd'^h 


(2.90) 


rm!Adl  + - + i(- 

V 4 \ v 


—e 


K^eu  + ^ ^ , (W^PetT^ej,  + W;et7<‘Pet) 

V 2 Singly  ^ ^ /"  / 


+ “ 


• a a 

Sm  ^vyCOS  dw  ^ 


(2.91) 


33 


In  this  Lagrangian,  the  fields  and  W are  linear  combinations  of  the  first  two 
components  of  the  original  W®  gauge  bosons  that  came  from  SU(2)l: 


(wi  T ml) . 


(2.92) 


The  and  fields  are  linear  combinations  of  the  third  component  of  the 
original  W®  gauge  bosons  and  the  gauge  field  that  came  from  U{1)y- 

Z-  = («W2  - S'B„)  (2.93) 

A.  = + 9'B„)  . (2.94) 

The  term  in  the  Lagrangian  implies  that  the  W'''  and  W“  fields 

are  massive  with  mass  Mw  The  term  implies  that  the  Z field  is  also 

massive  with  mass  mz-  On  the  other  hand,  there  is  no  mass  term  for  the  A field! 
Therefore,  we  obtained  three  massive  gauge  bosons,  W'*',  W”  and  Z,  and  one 
massless  gauge  boson,  A.  The  massive  bosons  can  be  identified  as  the  carriers  of 
the  weak  interaction  and  the  massless  field.  A,  as  the  photon  field  of  QED. 

The  masses  are  expressed  in  terms  of  the  original  parameters  as  follows: 


= 

M|  = 


4 4 sin^  6w  ’ 

{g^  + g>^y 


(2.95) 


4 


4 sin^  9\v  cos^  6w 


(2.96) 


34 


The  angle  9w  is  defined  as 


sin  9w 
cos  9w 


g' 

Vg^  + g''^ 

g 

\Zp"+^ 


(2.97) 

(2.98) 


In  addition  to  the  fact  that  three  of  the  bosons  “acquired”  mass,  this  procedure 
also  recovered  the  mass  of  the  electron  which  is  implied  by  the  — mgee  term  in  the 
Lagrangian.  The  Higgs  field  is  also  massive  due  to  the  term.  In  terms  of 

the  original  parameters,  their  masses  are  given  by 

rrie  = Ce 


v/2’ 


(2.99) 


and 


(2.100) 


The  Jem  <^nc  f^rms  in  the  Lagrangian  denote  the  electromagnetic  and 
neutral  currents  respectively.  Their  explicit  form  is  given  by 


^EM  — — — ey^e,  (2.101) 

<^NC  — T^^eL'y^^eL  — — sin^  9w  Jem-  (2.102) 


As  can  be  seen  from  the  above  equation,  the  electromagnetic  current  is  the  correct 
QED  current  since  it  contains  equally  the  left-handed  and  right-handed  compo- 
nents of  the  electron. 


35 


To  include  the  other  leptons  in  the  Lagrangian,  we  just  add  the  corresponding 
terms  for  each  of  the  three  generations  of  leptons  given  below. 


\ 


/ . ^ 


r 


j \ ) \ ^'^  ) 


(2.103) 


The  electroweak  Lagrangian  for  all  leptons  then  becomes 


■EW 


+WjW^M, 


h 


1 


w 


1 H — 1 + 


V 


+ E 

£=e,fi,r 


1 ^ 
V 

h 


vibi^d^viL  + ^ 1 1 1 + E 1 1 ^ 


V 


^ h 1 f h 

1 H 7 I “ 

V 4 V f 


sm  cos  Uw 

(wj  + +w;  Jgj) 


v2  sin  0\v 


(2.104) 


where  the  current  terms  are  given  by 


Jem 

= V’7^(T3  + Y)V^ 

(2.105) 

Jnc 

= 7^7'^(T3-sin2  0vv(T3  + Y))^ 

(2.106) 

7'/^ 

^CC 

= •07'^  (Ti  + iT2)  0. 

(2.107) 

36 


2.2.6  Extension  to  Quarks 


Since  quarks  are  also  fermions,  their  introduction  in  the  Lagrangian  is  very 
similar  to  that  of  leptons.  Unlike  neutrinos,  however,  the  down  quarks  are  massive 
and,  therefore,  they  have  a right-handed  component  as  well. 

We  consider  three  generations  of  quarks 


\ 


\ 


\di  I \d2  I [ds 


(2.108) 


and  organize  each  generation  into  one  left-handed  isospin  doublet  and  two  right- 
handed  isospin  singlets  as  follows 


/ \ 

, UiR  , dm,  (i  = l,2,3)  (2.109) 

\ «iL  I 

where  the  hypercharge  of  the  doublet  is  F = | and  the  hypercharges  of  the  singlets 
are  F = | and  F = — | respectively.  Table  2.2  lists  the  quantum  numbers  of  all 
leptons  and  quarks. 

Let  be  the  electroweak  Lagrangian  for  leptons  expressed  in  terms  of  the 
original  gauge  fields.  Then  the  full  Lagrangian  would  be 


^EW  — ^EW  (2.110) 

3 _ 

+ ^2  + umi7^DiJ.UiR  + dmij'^ D fidiRj  (2.111) 

i=l 

3 3 _ 

+ 53  + Cijdm  + h.c.  (2.112) 

i=i  j=i 


37 


Table  2.2:  Flavor  quantum  numbers  of  leptons  and  quarks. 


T Ts  Y Q 

^eL  ^fiL  ^tL 

1 1 _i  0 

2 2 2 

Gl  Tl 

1 _1  _1  _1 
2 2 2-^ 

IJ'R  TR 

0 0-1-1 

Ul  Cl  tl 

1112 
2 2 6 3 

di  Si  bi 

111  1 
2 2 6 3 

Ur  Cr  tR 

0 0 1 1 

dR  Sr  bR 

0 0 -1  -1 

where 

0 = iT2<j)*.  (2.113) 

The  terms  in  2.111  are  the  free-field  massless  quark  terms.  The  terms  in  2.112  are 
the  Yukawa  couplings  of  quarks  to  the  (f)  field.  They  will  make  the  quarks  massive. 

The  coupling  constants,  Cij,  are  complex  and  they  generally  do  not  form  a 
hermitian  matrix.  The  fact  that  Cij  is  not  diagonal  implies  that  the  quark  fields 
appearing  in  Cew  are  not  the  physical  fields.  The  latter  are  a linear  combination 
of  the  Ui  and  dj  fields  which  can  be  obtained  by  diagonalizing  Cij  in  order  to  obtain 
the  physical  quark  fields  of  well  defined  mass. 


38 


As  was  the  case  with  leptons,  we  choose  the  unitary  gauge  and  rewrite  the 
Lagrangian  in  terms  of  the  physical  Higgs  field,  h\ 


■EW 


— c! 

J 


EW 
3 

+ 1:  (Oi.  D^QiL  d"  D fiUiJi  “H  D fj,diR^ 

i=l 

3 3 


(2.114) 

(2.115) 


V 


+ 


V 


i=l  j=l 


(^CijUiL^jR  “1“  Cij  diidjR^  + h.c.  (2.116) 


We  now  define  the  two  mass  matrices  corresponding  to  the  up  quarks,  M“,  and 
the  down  quarks,  as  follows 


(2.117) 

(2.118) 


and  rewrite  the  Lagrangian  as 


r} 

*^EW 

3 

+ 53  {QiLi^^D^Q 

i=l 


iL  + umi'y^D^um  + dmi'y^  D udij^ 
{MfjUiLUjR  + MfjdiidjB^  + h.c. 


(2.119) 

(2.120) 

(2.121) 


To  identify  the  physical  quark  fields,  we  must  diagonalize  the  2 mass  matrices. 
It  can  be  shown  that  any  complex  matrix,  M,  can  be  diagonalized  by  multiplying 
it  on  the  left  and  right  with  two  different  unitary  matrices.  In  our  case,  we  will 


39 


diagonalize  M“  and  as  follows 


UE^M'^Ur  = 


rriu  0 0 ‘ 

0 me  0 

1 0 0 mf  y 


(2.122) 


and 


DT^M'^Dr 


^ md  0 o' 
0 0 
1 0 0 mft  , 


(2.123) 


where  Ur,  Ur,  Dr  and  Dr  are  unitary  matrices  and  mu,  me,  mt,  md,  mg  and  mi, 
are  the  physical  masses  of  the  quarks. 

These  transformations  imply  that  in  order  to  diagonalize  the  mass  matrices  we 
must  change  the  original  quark  fields  Ui  and  di  as  follows 


U2 


= u 


L,R 


L,R 


^ y 


L,R 


(a  ^ 

d\ 

C?2 

\i%  ) 


= D 


L,R 


(2.124) 


L,R 


V 


L,R 


where  (u,  c,  t)  and  (d,  5,  h)  are  the  physical  quark  fields.  Using  these  transforma- 
tions, we  can  rewrite  the  Yukawa  terms  as  follows: 


'EW 


= C 


i 

EW 
3 


(2.125) 


+ {QiL^l^D^QiL  + UiRij^D^UiR  + diRi'y^D^diji^  (2.126) 


1=1 

V 


E 


m^qq 


q=u,d,c,s,t,b 


(2.127) 


40 


We  must  also  apply  the  same  transformations  to  the  free  quark  terms.  Due 
to  the  quark  isospin  doublet  and  the  covariant  derivative,  there  are  terms  like 
unj^diL,  for  example,  which  transform  as 


{ui,U2,U3)lY 


f A ^ 
tti 

d,2 

\d3  / 


= (u,  c,  t)LUlDL-l“ 


(2.128) 


Therefore,  there  is  mixing  between  generations  since  the  matrix 


V = uIDl 


(2.129) 


is  not  the  identity  matrix.  At  the  same  time,  this  mixing  can  only  occur  between 
up  and  down  quarks,  but  not  up  and  up  quarks.  For  example,  there  is  no  mixing 


m 


{ui,U2,U3)l 


\ 
Ui 

«2 


u 


— (^)  C,  t)fiUiUL 


+ h.c. 


(2.130) 


since  17/,  is  unitary. 

The  matrix  V is  called  the  Cabibbo-Kobayashi-Maskawa  (CKM)  matrix  and 
its  individual  components  describe  the  relative  strength  of  mixing  between  various 
quarks.  It  can  be  shown  that  the  CKM  matrix  can  be  parametrized  in  terms  of 
only  4 parameters  as  follows 


V = R2{-d2)Rl{-d^)D{5  - 7r)i?2(^3) 


(2.131) 


41 


where 


Ri{Oi) 


0 


Cj  0 
0 0 1 


^ 1 0 0 ^ 


R2{0i)  = 


J 


D{6) 


1 0 0 
0 1 0 
0 0 


\ 


0 Cj  Si 

0 -Si  Ci 


J 


J 


(2.132) 


and  Ci  — cos  9i,  Si  = sin^j.  It  is  customary  to  write  the  CKM  matrix  in  the 
following  form: 


V = 


Cl 


S1C3 


-S1S3 


\ 


S1C2  C1C2C3  - 52336**^  C1C2S3  + 52036**^ 


S1S2  C1S2C3  + C2S3e*'^  C1S2S3  - C2C3e*'^ 


(2.133) 


J 


By  convention,  the  mixing  effect  is  only  given  to  the  T3  = — ^ states,  i.e. 


s' 


= V 


s 

\>’I 


(2.134) 


Therefore,  the  quark  weak  eigenstates,  which  are  not  mass  eigenstates,  are: 


5 

, Ur  , Cr  , tR  , dR  , Sr  , bR. 


(2.135) 


42 


We  now  rewrite  the  free  quark  terms  in  terms  of  the  physical  quark  fields  and 
obtain  the  final  form  of  the  electroweak  Lagrangian: 


CsM  = -iTr(W^,Wn- 


+W+W-'^M^  1 1 + 


1 + 


V 


h 


V 


+ E 

i=eyfljT  . 


veiiYd^i.i'eL  + ^ -me{l  + 


h 


V 


+ E q 


h 

i'r^dfj,  - m,  ( 1 + - 


+^d^hdf^h  - 


h 1 f h 

1 H ^ T ( “■ 

V 4 V?; 


■ n n 

^ sm^vrcos^vT 

(w:  JSc  + +w;  Jgi) . 


\/2  sin  6w 


(2.136) 


The  electromagnetic  current,  is  given  by 


Jem  = E + E 


(2.137) 


where  Qq  are  the  quark  charges.  The  neutral  current,  is  given  by 


\ 


7'/^ 

} 


NC 


(l/e,r'^,I/x)-Pi 


u 


+ (e,  t)j^  ( “TT  + 


\Vr  ) 


\ 

e 


v^y 


43 


+ {u,  c,  i)Y 


^ sin^  6w 


V ^ J 


+ {d,s,b)'yf^  {-^  + ^sm^9w 


V V 


where  = -(1  — 75).  The  charged  current,  Jcci  is  given  by 


rr^ 
yjt 


CC 


= {ye,yn,yr)LY 


\ 

e 


Jl 

\ 

e 


vv 


+ {u,C,t)rY 


+ K c>^l7''V 


(2.138) 


As  can  be  seen  from  the  equations  above,  only  the  charged  current  contains 
the  CKM  matrix,  V,  and,  therefore,  it  is  the  only  flavor  changing  current  in  the 
Lagrangian.  For  example,  the  Vii  CKM  matrix  element  allows  the  decay  d 
u + W~  which  is  responsible  for  the  decay  of  the  neutron.  On  the  other  hand,  the 
absence  of  the  CKM  matrix  from  the  neutral  current  prohibits,  for  example,  the 
process  s d + Z. 

The  electroweak  theory  successfully  describes  all  experimentally  observed  pro- 
cesses involving  the  weak  and  electromagnetic  interactions.  In  addition,  even  if 


44 


some  day  it  is  shown  that  the  neutrinos  have  mass,  the  Lagrangian  is  easily  ex- 
tendible to  that  case  as  well  since  the  procedure  would  be  the  same  as  that  for 
quarks.  It  would  involve  one  more  mass  diagonalization  and  the  introduction  of  a 
second  CKM  matrix. 

In  the  next  section,  we  will  extend  the  electroweak  theory  to  include  the  strong 
nuclear  interactions  as  well. 

2.3  Quantum  Chromodynamics 

Quantum  chromodynamics  (QCD)  describes  the  strong  nuclear  interactions. 
They  are  carried  by  8 gauge  particles  called  gluons.  Since  gluons  are  massless,  we 
can  proceed  exactly  as  we  did  in  QED — chose  a gauge  group,  extend  the  covariant 
derivative  and  add  the  free  terms  for  the  gauge  fields  to  make  them  dynamical 
variables. 

The  simplest  group  with  8 generators  is  the  SU{3)  group.  We  will  call  it  the 
color  SU{3)c  gauge  group  of  strong  interactions.  The  generators  of  this  group  are 
where  \a  are  the  3x3  Gell-Mann  matrices.  The  gauge  transformations  are 
given  by 

Uc{x)  = (2.139) 

We  now  extend  the  Lagrangian  of  the  electroweak  interactions,  Cewi  to  the 
full  Lagrangian  of  the  Standard  Model,  Csm,  as  follows 

CsM  = Cew  - \tt  (g;,g;-) 


(2.140) 


45 


where 

Gl^{x)  = d,Gt{x)  - d^Glix)  - gJabcG';,{x)Gl{x),  (2.141) 

the  constant  Qs  is  a new  parameter  of  the  theory  specifying  the  strength  of  the 
strong  interaction,  the  fate  are  the  structure  constants  of  SU{3)c  and  the  covariant 
derivative  in  Csm  has  been  extended  to 

D^  = dfi  + igsG^Fa  + igW^Ta  + ig'B^Y  (2.142) 

where  Fq  is  the  color  representation  of  the  corresponding  field  and  G“  are  the 
gluon  gauge  fields. 

The  full  gauge  symmetry  of  Csm  is  SU{3)c  x SU{2)l  x U{1)y-  The  Y,  Tq 
and  Fo  matrices  form  a representation  of  the  generators  of  the  group  SU{3)c  x 
SU{2)i  X U{1)y-  They  obey  the  following  commutation  rules: 


[Fo,  Fft]  = ifabcFc 

(2.143) 

[To,  Tft]  = iCabcTc 

(2.144) 

[To,F6]=0 

(2.145) 

[To,Y]  = 0 

(2.146) 

[Fo,Y]  = 0 

(2.147) 

All  quark  fields  in  Csm  are  triplets  and  all  lepton  fields  are  singlets  since  they 
do  not  participate  in  strong  interactions.  In  addition,  the  field  0 is  also  a color 
singlet.  As  a consequence,  the  spontaneous  symmetry  breaking  in  the  electroweak 


46 


sector  is  exactly  the  same  as  before  and  breaks  the  full  symmetry  as  follows: 


SU{3)c  X SU{2)l  X U{1)y  ^ SU{3)c  x U(1)qed- 


(2.148) 


Taking  into  account  that  only  quarks  are  affected  by  the  gluon  gauge  fields  G“ 
in  the  extended  derivative,  we  arrive  at  the  final  form  of  the  Lagrangian  for  the 
Standard  Model: 


= -iTr(G;,G;-)-iTr(W,,W'“)-iB„„B 


fJLl/ 


+WJ W-'AfS,  1 1 + ^)  +iz^Z'm|(l  + ^ 


+ E 

i=eyfjL,r  . 


(i' 


h 


ueiiYO^UiL  + i ^ 1 + - 1 ] £ 


+ E q 

q=u^d^c^s^tfb 


X. 
2 


h'"  I 9n  + I - m„  I 1 + - 


h 

V 


+^d^hd^h  - 


, h 1 (h 

1 H 7 ( “ 

V 4 V n 


p\  7 7"/^ 

sin  u\y  cos  u\y 


\/2  sin  6w 


(w:  JSc  + +w;  J2;t) 


(2.149) 


This  Lagrangian  contains  18  parameters®  and  they  are  as  follows: 


3 coupling  constants  gs,  e,  sin  Ow 
2 boson  masses  Mw , Mh 

^Strictly  speaking,  there  are  19  parameters  since  the  Lagrangian  of  QCD  should  also  contain 

the  term  — which  does  not  contribute  in  perturbation  theory.  If  the  parameter  6 is 

zero,  however,  some  of  the  predictions  of  QCD  do  not  agree  with  experiment. 


3 lepton  masses 
6 quark  masses 


47 


mu,rnd,mc,rns,mt,mb 

4 CKM  parameters  01,62,63,  S. 

Despite  the  large  number  of  parameters,  the  Standard  Model  is,  nevertheless, 
the  most  successful  theory  constructed  so  far.  Its  predictions  are  consistent  with 
all  experimental  data.  It  is  very  important,  however,  that  the  Higgs  particle  be 
discovered  in  order  to  fully  validate  this  theory. 


CHAPTER  3 

DATA  ANALYSIS  IN  HIGH  ENERGY  PHYSICS 


Modern  high  energy  physics  experiments  are  performed  at  particle  accelerators. 
In  this  work,  we  are  specifically  interested  in  hadron-hadron  colliders  where  protons 
and  anti-protons  collide  with  each  other  (see  Fig.  3.1)  at  high  energy  (about 
1 TeV).  Most  collisions  result  in  events  that  are  understood  in  terms  of  QCD. 


Figure  3.1:  Illustrates  the  collision  of  a proton  and  anti-proton  which  results  in  the 
production  of  new  particles.  These  particles  can  then  decay  into  other  particles. 

f 

They  are  a direct  manifestation  of  the  parton  (or  quark)  structure  of  hadrons 
and  the  increase  of  the  strong  force  with  distance.  In  other  words,  the  quarks 
within  a hadron  are  essentially  free  particles  until  they  interact  with  a quark  in 
another  hadron.  In  that  case,  the  nuclear  force  of  separation  increases  until  a new 
quark-antiquark  pair  is  created.  At  high  energies  of  collision  this  results  in  the 
continuous  creation  of  quark-antiquark  pairs  along  the  direction  of  the  momentum 
of  the  original  quark.  The  produced  quarks  eventually  recombine  into  various 


48 


49 


particles  and  are  deposited  in  the  detector  as  a burst  of  localized  energy  which  is 
called  a jet.  These  events  constitute  the  “ordinary”  QCD  background  and  since 
they  are  predominant,  it  is  a challenging  task  to  isolate  the  relatively  few  collision 
events  that  correspond  to  some  new  physics. 

Hadron  collider  events  can  be  very  complicated  due  to  both  the  large  number 
of  produced  particles  as  well  as  the  multiple  decays  they  can  undergo.  In  addition, 
there  are  so  many  variables  that  describe  a high  energy  collider  event  that  it  is 
not  obvious  which  variables  best  isolate  the  signal  from  the  background.  A signal 
event  is  any  event  that  corresponds  to  the  production  of  particles  we  are  interested 
in.  For  example,  in  Chapters  7 and  8 we  will  be  interested  in  the  production  of  top 
quark  pairs.  In  that  case,  all  collisions  that  produce  a top  and  anti-top  particle 
will  be  signal  events.  Background  events  are  all  other  events. 

Data  analysis  in  high  energy  physics  aims  to  isolate  specific  events  from  the 
background.  In  general,  one  has  to  represent  the  raw  data  in  a form  suitable  for 
analysis  and  then  apply  certain  transformations  and  selection  criteria  which  lead  to 
an  enhancement  of  the  number  of  signal  events  relative  to  the  background  events. 
The  three  most  important  measures  related  to  the  isolation  of  signal  events  are  as 
follows:  First,  the  enhancement,  Fg„/i,  is  defined  as 


% of  surviving  signal  events 
% of  surviving  background  events 


where  “surviving  events”  are  those  that  “survived”  the  isolation  criteria.  Enhance- 
ment is  a measure  of  the  relative  improvement  of  signal  events  over  background 


50 


events.  A value  of  infinity  would  indicate  that  all  background  events  were  removed 
by  the  selection  criteria. 

The  second  measure  is  the  efficiency,  Fgff,  defined  as 


Feff  = % of  surviving  signal  events 


which  measures  how  many  signal  events  are  left  after  applying  the  selection  criteria. 
A value  of  100%  would  indicate  that  the  selection  criteria  did  not  remove  any  of 
the  signal  events. 

The  third  measure  is  the  statistical  significance  measure,  Rf,  defined  as 


where  Ngig  is  the  number  of  surviving  signal  events  and  Nbck  is  the  number  of 
surviving  background  events.  This  measure  takes  into  account  the  fact  that  the 
objective  is  to  have  both  high  enhancement  and  high  efficiency.  It  can  be  expressed 
in  terms  of  Fenh  and  Fg//  as  follows: 


We  can  summarize  the  data  analysis  as  shown  in  Fig.  3.2.  The  original 
(raw)  data  contains  few  signal  events  and  many  background  events  and,  thus,  Rf  is 
small.  The  data  is  transformed  one  or  more  times  in  order  to  have  a representation 
suitable  for  analysis.  The  transformations  do  not  change  the  number  of  signal  and 
background  events.  We  then  apply  a set  of  selection  criteria  which  remove  many 


51 


Figure  3.2:  Schematic  representation  of  data  analysis.  The  raw  data  has  very  few 
signal  events  and  many  background  events.  The  value  of  Rf  is  small.  The  data 
is  transformed  to  achieve  an  easy  for  analysis  representation.  When  the  selection 
criteria  are  applied,  a large  portion  of  the  background  is  removed  and  Rf  increases. 

background  events  and  few  signal  events.  As  a result  the  enhancement  increases, 
the  efficiency  decreases,  but  Rf  increases  as  well.  The  more  successful  the  analysis 
is,  the  bigger  the  increase  of  Rf  will  be. 

3.1  Data  Representation  and  Transformation 
3.1.1  Events  Characterization 

Each  collider  event  is,  in  principle,  completely  specified  by  the  energy  and 
momentum  of  all  particles  produced  in  the  proton-antiproton  collision.  Certain 
particles  (such  as  neutrinos),  however,  cannot  be  detected  at  all.  Other  particles, 
such  as  quarks  and  gluons,  produce  jets.  Muons,  on  the  other  hand,  can  be  detected 
very  accurately. 

Let  us  consider  the  schematic  representation  of  a collider  event  as  shown  in 
Fig.  3.3.  The  z axis  is  the  direction  of  the  proton-antiproton  beam.  The  two 
hadrons  collide  and  the  two  particles  (or  jets)  fiy  away  from  the  interaction  point. 
Their  direction  is  completely  specified  by  the  azimuthal  angle  </>  and  the  polar  angle 
6.  In  addition,  each  particle  has  energy  E,  momentum  P and  invariant  mass  m 


52 


Figure  3.3:  Schematic  representation  of  a collision  event.  The  z axis  is  the  direction 
of  the  two  hadron  beams.  The  direction  of  each  particle  is  expressed  in  terms  of 
the  azimuthal  angle  4>  and  the  polar  angle  6.  The  polar  angle  is  always  with  respect 
to  the  actual  location  of  interaction  between  the  two  hadrons. 

satisfying  the  relativistic  relation 


in  units  where  the  speed  of  light  is  1 . 

The  momentum  of  each  particle  can  be  decomposed  into  the  following  two 
components 

p2  = p2  ^ p2  (3  5) 

where  is  the  2:  component  of  the  momentum  and  Pt  is  the  transverse  momentum 
of  the  particle  which  can  also  be  written  as 


Pt  = PsinO 


where  P is  the  magnitude  of  the  momentum.  This  representation  is  very  useful 
since  the  longitudinal  component  of  the  momentum,  P^,  cannot  always  be  measured 
experimentally. 


53 


Another  useful  quantity  is  the  transverse  energy  defined  as 


Et  = EsinO. 


If  the  magnitude  of  the  momentum,  P,  is  inferred  experimentally  from  E,  then 
Ex  = Pt- 

The  cylinder  in  Fig.  3.3  represents  a detector.  Typically,  a detector  has  three 
layers: 

1.  The  inner  layer  is  a tracking  chamber  for  tracking  charged  particles.  This 
allows,  among  other  things,  the  reconstruction  of  the  actual  point  of  inter- 
action. The  latter  is  necessary  since  collisions  do  not  occur  at  the  same 
location. 


2.  The  next  layer  is  a calorimeter  which  measures  the  energy  deposited  by  jets 
and  certain  particles.  The  location  of  the  cells  determines  the  direction  of 
the  jet  (or  the  particle). 


3.  The  outer  layer  is  usually  a muon  tracking  device  since  muons  normally  es- 
cape the  calorimeter.  Muon  detection  is  very  accurate  and  provides  valuable 
information  about  the  event. 


As  can  be  seen  from  Fig.  3.3,  the  detector  does  not  cover  all  possible  values  for  6. 
Typically  the  range  of  0 is  2°  < ^ < 178°. 

A very  useful  quantity  for  specifying  the  polar  location  is  the  pseudorapidity, 
7/,  defined  as 


7]  = —In  tan(-). 


(3.9) 


54 


The  direction  of  each  event  can  then  be  expressed  in  terms  of  77  and  (j).  For  example, 
Fig.  3.4  shows  an  event  in  the  ri-(j)  space.  The  size  of  each  cell  is  ArjAcf)  = 0.1  x 7.5° 


Figure  3.4:  Representation  of  an  event  using  the  rj-<l)  space.  The  height  of  each  cell 
represents  the  energy  deposited  in  the  calorimeter  at  that  location.  The  straight 
line  is  a muon  detected  outside  the  calorimeter.  The  height  of  the  line  represents 
the  energy  of  the  muon. 

and  — 4 < 77  < 4 which  corresponds  to  polar  angles  2°  < 6 < 178°.  The  polar 
location  77  = 0 always  corresponds  to  the  actual  point  of  interaction  between  the 
two  hadrons.  This  implies  that  experimental  data  is  adjusted  for  each  event  so 
that  the  interaction  point  is  centered. 

The  height  of  each  cell  in  Fig.  3.4  represents  the  energy  deposited  in  the 
calorimeter  at  this  location.  The  straight  line  represents  a muon  and  its  height 
represents  its  energy. 

If  we  “unwrap”  the  event  shown  in  Fig.  3.4  with  respect  to  the  azimuthal 


55 


Figure  3.5:  Representation  of  the  event  shown  in  Fig.  3.4  using  the  r}~4>  plane,  i.e. 
“unwrapping”  the  cylinder. 


56 


angle  0,  we  can  represent  the  same  event  as  a r}-4>  plane  as  shown  in  Fig.  3.5. 

In  summary,  a complete  specification  of  a collider  event  involves  the  specifica- 
tion of  the  energy  deposited  in  each  cell  in  the  r]-4>  space  as  well  as  leptonic  energy 
detected  outside  the  calorimeter.  The  point  of  interaction  always  corresponds  to 
r)  = 0 and  thus  it  has  to  be  adjusted  for  each  specific  interaction.  In  addition,  the 
information  from  the  tracking  chamber  inside  the  calorimeter  can  also  be  part  of 
the  event  specification.  However,  since  we  are  not  using  such  information  in  the 
analysis  presented  in  later  chapters,  it  would  be  beyond  the  scope  of  this  work  to 
discuss  such  issues. 

3.1.2  Jets 

The  specification  of  a collider  event  as  the  set  of  energies  deposited  in  each 
cell  is  quite  impractical  for  analysis  purposes.  Most  cells  have  little  or  no  energy 
deposited  in  them  and  only  a small  number  of  the  cells  are  actually  needed  to 
characterize  the  event  “almost  completely”.  For  that  reason,  the  data  is  usually 
preprocessed  to  isolate  the  few  cell  clusters  that  have  substantial  energy  deposited 
in  them.  These  clusters  are  called  jets.  To  define  a jet,  one  must  specify  two 
parameters: 

1.  One  must  define  the  radius  of  a jet  in  the  r]-(f)  space. 

2.  One  must  require  a minimum  transverse  energy,  for  a cluster  to  qualify 

as  a jet. 

Unfortunately,  these  two  parameters  are  fairly  arbitrary  and  can  result  in  loss 
of  information  about  the  event.  For  example,  the  number  of  jets  in  an  event. 


57 


also  called  multiplicity,  is  an  important  discriminating  factor  between  signal  and 
background.  Yet,  the  multiplicity  depends  on  both  the  jet  radius  and  As  a 

consequence,  for  example,  should  the  jet  radius  be  “too  small” , background  events 
with  high  number  of  jets  would  look  like  signal  events  since  a small  jet  radius 
implies  a smaller  multiplicity. 

In  Chapters  6,  7 and  8 we  will  use  jets  in  some  of  the  analysis.  In  those  cases, 
however,  we  will  always  use  at  least  two  different  values  of  the  jet  radius  in  order 
to  diminish  the  negative  effect  of  the  arbitrariness  inherent  in  the  definition  of  a 
jet. 


3.1.3  Event  Topology 

There  is  a very  powerful  way  to  avoid  the  definition  of  a jet  and  yet  specify  an 
event  with  a small  number  of  parameters.  The  basic  idea  is  to  use  the  shape  of 
the  entire  event  and  to  expand  it  in  terms  of  a complete  set  of  suitable  functions. 
The  coefficients  of  this  expansion  are  then  representing  the  event  as  a whole. 


Fox- Wolfram  Coefficients 


In  1979,  Geoffrey  Fox  and  Stephen  Wolfram  [4,  5,  6]  constructed  a complete 
set  of  rotationally  invariant  observables.  He,  which  can  be  used  to  characterize  the 
“shapes”  of  the  final  states  in  electron-positron  annihilations.  They  are  constructed 
from  the  momentum  vectors,  p,  of  all  the  final  state  particles  as  follows. 


\Pi 


2 


5 


(3.10) 


58 


where  the  inner  sum  is  over  the  particles  produced  and  ^^e  the  spherical 
harmonics.  Here  one  must  choose  a particular  set  of  aixes  to  evaluate  the  angles, 
<i>i)i  of  the  final  state  particles,  but  the  values  of  the  He  are  independent  of 
this  choice.  These  moments  lie  in  the  range  0 < He  < 1 and  if  energy  is  conserved 
in  the  final  state  then  Hq  = 1 {neglecting  the  masses).  If  momentum  is  conserved 
in  the  final  state  then  H\  = 0. 

The  Fox- Wolfram  observables  (or  moments)  constitute  a complete  set  of  shape 
parameters.  For  example,  the  collinear  “two-jet”  final  state  results  in  « 1 for 
even  i and  He  ~ 0 for  odd  £.  Events  that  are  completely  spherically  symmetric 
give  He  = 0 for  all  £ ^ 0. 


Modified  Fox- Wolfram  Moments  Applied  to  Calorimeter  Cells 


In  hadron-hadron  collisions  spherical  symmetry  is  lost  and  we  are  interested 
more  in  the  shape  of  events  in  the  transverse  plane.  For  example,  the  Fox- Wolfram 
moments  when  applied  directly  to  hadron-hadron  collisions  would  interpret  a min- 
imum bias  event  as  a “two-jet”  event,  whereas  we  would  like  to  have  a mini- 
mum bias  event  treated  more  like  a spherically  symmetric  e'''e“  final  state  (i.e.,  no 
structure).  To  accomplish  this,  we  define  the  modified  Fox- Wolfram  moments  for 
hadron-hadron  collisions  as  follows. 


Heicell) 


An 

2£+l 


-j-i  cells 

E I E yr(Oi) 

m=—£  i 


Et  {snm) 


(3.11) 


where  the  inner  sum  is  over  all  the  calorimeter  cells  in  the  event  with  transverse 
energy,  E\.,  greater  than  some  minimum  (for  example,  5 GeV)  and  f2j  = (9i,  4>i)  are 


59 


the  angular  locations  of  the  center  of  the  cell.  In  this  case,  ET^sum)  is  the  total 
transverse  energy  of  all  the  cells  that  are  included  in  the  sum.  The  calorimeter 
cells  contain  all  the  information  concerning  the  topology  of  the  event  and  it  is  not 
necessary  to  define  jets.  These  modified  moments  also  lie  in  the  range  0 < Hi  < 1 

y\ 

and  by  definition  Hq  = 1. 


Modified  Fox- Wolfram  Moments  Applied  to  Jets 


Instead  of  using  the  calorimeter  cells  directly  to  characterize  the  event  topology 
one  can  define  jets  and  use  them  to  construct  modified  Fox- Wolfram  moments.  Of 
course,  one  should  be  aware  that  the  jet  definition  can  potentially  result  in  a loss 
of  information  and  decrease  the  discrimination  between  signal  and  background. 

We  define  jets  using  a simple  algorithm.  One  first  considers  the  “hot”  cells 
(those  with  transverse  energy  greater  than,  say,  5GeV).  Cells  are  combined  to 
form  a jet  if  they  lie  within  a specified  “radius”  = A?7^  + in  rj-cl)  space  from 
each  other.  Jets  have  an  energy  given  by  the  sum  of  the  energy  of  each  cell  in  the 
cluster  and  a momentum  pj  given  by  the  vector  sum  of  the  momenta  of  each  cell. 
The  invariant  mass  of  a jet  is  simply  Mj  = Ej  -pj  -pj.  For  example,  in  Chapter  8 
we  will  examine  both  “narrow”,  Rj  = 0.4,  and  “fat”,  Rj  = 0.7,  jets,  where  jets  are 
required  to  have  at  least  certain  amount  of  transverse  energy,  usually  about 

15CeV. 

The  modified  Fox- Wolfram  moments  are  constructed  from  jets  as  follows. 


He{jets)  = 


47T 

2C  + 1 


-\-i  jets 
m=—i  i 


m 


(a) 


Et  {sum) 


2 


(3.12) 


60 


where  the  inner  sum  is  now  over  all  the  jets  in  the  event  with  transverse  energy, 
E^,  greater  than  and  = {6i,  (pi)  are  the  angular  locations  of  the  jets.  Here, 
ET{sum)  is  the  sum  of  the  transverse  energy  of  all  the  jets  that  are  included  in 
the  sum. 

One  can  use  the  modified  Fox- Wolfram  moments  constructed  either  from  the 
cells  or  from  the  jets.  In  either  case  the  Hi’s  characterize  the  topology  of  the  event. 
The  usage  of  cells,  however,  is  preferable  since  it  obviates  the  need  for  defining  jets. 
In  Chapter  8 we  will  show  that  using  modified  Fox- Wolfram  moments  with  cells 
gives  at  least  as  good  results,  or  better,  compared  to  using  jets. 

One  can  usually  discriminate  between  signal  and  background  quite  well  by  just 
making  a simple  cut  on  Hi{max).  But  one  can  do  much  better  by  considering  the 
first  several  moments.  We  usually  consider  the  first  six  moments.  Hi, . . . , Hq,  which 
form  a six  dimensional  space.  Different  regions  of  that  space  correspond  to  different 
event  topologies.  Signal  enhancement  can  be  achieved  by  selecting  the  region  (or 
regions)  in  this  event  space  that  are  mostly  “populated”  by  signal  events.  This 
selection  of  regions  (see  next  Section)  can  be  done  either  by  using  multi-dimensional 
linear  cuts,  Fisher  discriminates,  or  by  applying  a neural  network  directly  to  the 
modified  Fox- Wolfram  coefficients,  Hi[7]. 

3.2  Methods  For  Signal  Enhancement 

As  already  discussed,  the  purpose  of  data  analysis  in  high  energy  physics  is 
to  enhance  the  signal  over  the  background  while  at  the  same  time  preserving 
enough  of  the  signal  events.  In  most  cases,  the  number  of  signal  events  for  a decay 
process  is  orders  of  magnitude  smaller  than  the  number  of  background  events.  As 


61 


a consequence,  it  is  a very  challenging  task  to  define  a set  of  selection  criteria  that 
discards  most  of  the  background  events  while  retaining  a large  percentage  of  the 
signal. 

In  this  section,  we  concentrate  on  the  most  popular  methods  for  signal  enhance- 
ment, linear  cuts  and  Fisher  Discriminates,  as  well  as  on  the  emerging  application 
of  neural  networks  in  high  energy  physics  data  analysis.  In  Chapters  6,  7 and 
8 we  will  apply  all  three  methods  and  demonstrate  their  relative  advantages  and 
disadvantages. 

3.2.1  Linear  Cuts 

The  most  common  way  to  enhance  the  signal  for  a process  is  to  apply  a set  of 
cuts  to  various  variables  that  describe  signal  and  background  events  or  to  variables 
that  are  derived  from  the  “raw”  data.  For  example,  if  the  signal  events  have  a 
leptonic  component  that  consists  of  electrons  or  muons,  it  is  very  reasonable  to 
demand  that  the  event  contain  at  least  one  isolated  high  transverse  momentum 
charged  lepton,  i.e.  that 

Prie^)  > (3.13) 

where  Pr{i^)  is  the  transverse  momentum  of  an  electron  or  muon  and 
is  the  minimum  value  necessary  to  consider  an  event  as  a candidate  for  signal. 

Another  very  common  cut  is  to  demand  that  the  total  transverse  energy, 
ET{sum),  exceeds  certain  threshold  value.  For  example,  in  our  analysis  of  top- 
quark  signal  enhancement  for  the  four-jet  decay  mode  of  top  we  will  demand  that 
Erisum)  > 100  GeV.  This  criterion  reflects  the  fact  that  events  with  low  to- 


62 


tal  transverse  energy  are  not  likely  to  be  signal  events  since  the  mass  of  the  top  is 
about  175  GeV  and,  therefore,  the  total  energy  of  the  event  (including  leptonic  and 
missing  energy  from  neutrinos)  must  be  at  least  350  GeV  which  is  the  minimum 
energy  for  top-antitop  pair  production.  Cuts  on  the  minimum  total  transverse 
energy  are  very  powerful  discriminators  between  signal  and  background.  However, 
they  can  cause  the  mass  distribution  of  the  background  events  to  peak  near  or  at 
the  invariant  mass  of  the  signal  events. 

Selection  cuts  are  typically  applied  in  sequence.  In  other  words,  a set  of  criteria 
is  applied  sequentially  to  the  data  and  each  one  of  them  improves  the  enhancement 
and  reduces  the  efficiency.  It  must  be  stressed,  however,  that  sequential  application 
of  linear  cuts  is  in  general  not  equivalent  to  the  simultaneous  application  of  all  cuts 
at  the  same  time.  As  a consequence,  in  addition  to  the  assumptions  behind  each 
selection  cut,  the  sequential  application  can  result  in  inferior  results  compared  to 
considering  all  (or  at  least  a subset  of)  cuts  at  the  same  time,  i.e.  performing 
multi-dimensional  linear  cuts. 

Multi-dimensional  linear  cuts  are  difficult  to  use  for  at  least  two  reasons: 

1.  Finding  the  optimal  region  is  difficult  to  do  when  using  local  minimization 
algorithms  due  to  local  minimums  in  the  multi-dimensional  parameter  space 
of  cuts. 

2.  It  is  not  always  possible  to  even  use  local  algorithms  since  the  performance 
measure  with  respect  to  the  parameters  may  be  zero  almost  everywhere. 

Both  of  these  problems  can  be  overcome  if  using  global  minimization  algorithms. 
In  Chapter  5 we  will  present  a practical  universal  genetic  algorithm  and  how  it  is 


63 


applied  to  multi-dimensional  linear  cuts.  In  Chapter  8,  when  we  consider  the  six- 
jet  decay  mode  of  top,  we  will  apply  multi-dimensional  linear  cuts  on  the  modified 
Fox-Wolfram  coefficients  with  the  statistical  significance  measure,  Rf. 


3.2.2  Fisher  Discriminates 


Another  method  of  separating  signal  and  background  is  to  use  Fisher  discrimi- 
nates[8].  Let  N be  the  number  of  variables  that  describe  an  event.  For  example, 
these  variables  could  be  the  total  transverse  energy  in  the  calorimeter  cells,  the 
transverse  momenta  of  leptons,  missing  transverse  energy  from  neutrinos,  etc.  Let 
X{  denote  these  N variables.  Then  we  can  define  the  linear  discriminant  function, 
F{x),  as  follows 

N 

F{x)  = '^aiXi  (3.14) 

i=l 

where  otj  are  the  so  called  Fisher  coefficients.  Let  us  now  define  the  Fisher  criterion, 
Cf  G il,  as 


Cf  — 


,,bak\2 

PF  ) 


{off  + 


(3.15) 


where  and  are  the  mean  of  the  signal  and  background  distributions  of  F 


and  and  are  their  respective  standard  deviations. 

The  essence  of  Fisher  discriminates  is  to  find  the  Fisher  coefficients,  ai,  which 
maximize  the  Fisher  criterion,  Cf,  i.e. 


dC F 
dai 


(3.16) 


64 


for  alH  = 1, . . . , A^.  The  solution  of  this  equation  is 

N 

= {txf  - fif^)  (3.17) 


where  and  are  the  mean  of  the  signal  and  the  background  with  respect 
to  the  jth  variable,  i.e. 


^Xj(n) 

n=l 


(3.18) 


and 


1 ^^bcL  k 

= w~  £ 

^^bak  „=i 

where  Nsig  and  N^ak  are  the  number  of  signal  and  background  events  respectively. 
The  matrices  and  are  the  respective  covariance  matrices  given  by 


N. 


1 


N. 


sig 


(3.20) 


sig  n=l 


and 


'^bak  

^ij  — 


N, 


1 

£ (a;i(n)  - - /xf'®). 


bak  nz=l 


(3.21) 


Therefore,  finding  the  Fisher  coefficients,  aj,  involves  the  calculation  of  the  two 
covariance  matrices,  V,  and  the  inversion  of  the  matrix  (l/*®»  + y*'“^).  The  sim- 
plicity and  the  speed  with  which  theses  calculations  can  be  performed  are  a major 
advantage  for  using  Fisher  discriminates. 

Unfortunately,  Fisher  discriminates  do  not  give  as  good  results  as  neural  net- 
works, for  example.  In  most  cases  when  we  have  applied  both  Fisher  discriminates 
and  neural  networks  for  the  same  problem,  Fisher  discriminates  have  produced 


65 


an  inferior  result.  This  is  not  very  surprising  since  Fisher  discriminates  imply  a 
number  of  assumptions  which  are  in  most  cases  not  justified.  They  are  as  follows; 

1.  The  discriminant  function,  F{x),  is  linear.  This  is  clearly  a very  particular 
case  and  one  cannot  expect  that  this  simple  function  will  perform  well  in 
general. 

2.  The  Fisher  criterion,  Cp,  is  defined  in  terms  of  the  mean  and  the  standard 
deviation  of  the  signal  and  background  distributions.  This  implies  the  as- 
sumption that  these  distributions  are  normal.  While  this  is  often  the  case  for 
signal  distributions  (although  by  no  means  always) , background  distributions 
are  closer  to  Poisson  distributions^. 

3.  The  Fisher  coefficients,  ai,  maximize  the  Fisher  criterion,  Cp,  but  they  gen- 
erally do  not  maximize  the  statistical  significance  measure,  Rf.  Fisher  dis- 
criminates maximize  the  “distance”  between  the  signal  and  background  dis- 
tributions which  are  presumed  normal.  On  the  other  hand,  the  statistical 
significance  measure,  Rf,  takes  into  account  that  the  preservation  of  signal 
events  has  more  weight  than  the  reduction  of  background  events. 

In  Chapter  6 we  will  show  that  neural  networks  are  a superior  tool  compared  to 
Fisher  discriminates.  Nevertheless,  Fisher  discriminates,  despite  the  disadvantages 
outlined  above,  are  useful  even  when  using  neural  networks  or  other  superior  tools. 
At  the  very  least,  Fisher  discriminates  can  be  used  to  establish  a lower  limit  for 

signal  enhancement  and,  therefore,  serve  as  a consistency  check  for  other  methods. 

^This  is  especially  true  if  we  have  not  performed  a cut  on  the  total  transverse  energy  as 
discussed  earlier. 


66 


3.2.3  Neural  Networks 


Another  method  for  signal  enhancement  is  to  use  neural  networks.  This  method 
is  relatively  new  in  the  area  of  high  energy  physics,  but  is  gaining  more  popularity 
due  to  the  superior  results  it  can  produce  compared  to  other,  more  traditional, 
methods. 

A neural  network  can  be  viewed  as  a complicated  non-linear  discriminant  func- 
tion which  is  composed  of  a large  number  of  simpler  non-linear  functions.  As 
before,  let  there  be  N variables  describing  an  event  and  let  us  denote  these  vari- 
ables as  Xi.  Thus  each  event  is  a point,  x,  in  the  space  X C and  Xi  are  the 
projections  oi  x E X.  Let  N{x)  be  a function  that  represents  an  entire  neural 
network.  This  is  analogous  to  the  Fisher  discriminant  function,  F(x),  except  that 
N{x)  is  non-linear.  In  addition,  N{x)  is  composed  of  a large  number  of  simpler 
non-linear  functions,  f{x),  that  represent  a neuron  as  shown  in  Fig.  3.6.  This 


Output 


Figure  3.6:  Schematic  representation  of  a neural  network  with  2 input  variables, 
8 neurons  and  one  single  output.  Each  neuron  is  represented  by  the  non-linear 
function  f(x).  The  output,  N{x),  as  a function  of  the  input  variables  x E FF  can 
be  viewed  as  a composite  non-linear  discriminant  function. 


figure  shows  the  schematic  structure  of  a neural  network  with  8 neurons  and  two 
input  variables  describing  an  event.  The  output  of  the  neural  network,  N{x),  is 
then  used  in  conjunction  with  some  performance  measure.  In  particular,  one  could 


67 


use  the  output  from  the  neural  network  as  a reduced  representation  of  the  collider 
events.  The  multi-dimensional  input  space  X is  reduced  to  the  single-dimensional 
output  space  Y = N{X). 

The  most  common  form  of  the  neuronal  function  f{x)  is 

Nin 

f{x)  = S{t  + '^WiXi)  (3.22) 

i=l 

where  S is  some  non-linear  activation  function,  Nm  is  the  number  of  inputs  to  the 
particular  neuron,  Wi  are  neuronal  coefficients  called  synaptic  weights,  t is  a free 
coefficient  called  the  threshold  and  Xi  are  the  Nin  inputs  to  the  neuron. 

As  can  be  seen  from  Eq.  3.22,  if  the  neural  network  has  a single  neuron  and 
the  activation  function,  S,  is  the  identity  function  and  the  threshold  value,  t,  is 
zero  then  the  network  function  N{x)  would  be  identical  to  the  Fisher  discriminant 
function  F{x).  In  that  case,  the  synaptic  weights,  W{,  would  play  the  role  of  the 
Fisher  coefficients,  ai.  Therefore,  neural  networks  can  be  viewed  as  a generalization 
of  the  Fisher  discriminant  function,  F{x).  This  is  the  mathematical  basis  for  the 
superior  performance  of  neural  networks  as  compared  to  Fisher  discriminates  as 
will  be  demonstrated  in  Chapter  6. 

In  the  next  chapter,  we  will  present  neural  networks  in  more  detail  and  address 
specific  issues  related  to  their  application  to  high  energy  physics. 


CHAPTER  4 
NEURAL  NETWORKS 


Neural  networks  model  the  functionality  of  the  nervous  system  of  an  individ- 
ual. They  map  input  stimuli  into  output  responses  in  a robust  fashion.  From  a 
structural  point  of  view,  the  robustness  is  achieved  by  a distributed  network  of 
neurons.  From  a functional  point  of  view,  robustness  is  achieved  by  the  ability 
of  such  networks  to  recognize  patterns  that  are  not  defined.  The  latter  feature  is 
extremely  important  since  in  most  non-trivial  applications  it  is  either  impossible 
or  impractical  to  classify  the  input  stimuli. 

High  energy  physics  data  from  hadron  colliders  cannot  be  easily  classified  into 
well  defined  patterns.  It  is  very  common  for  signal  events  to  resemble  background 
events  and,  as  a consequence,  the  separation  between  signal  and  background  is  a 
very  challenging  task.  As  will  be  shown  in  Chapter  6,  neural  networks  can  give 
better  results  than  traditional  techniques.  We  will  use  the  hadron  collider  data  as 
input  stimuli,  and  the  response  from  a neural  network  will  represent  the  degree  of 
separation  between  signal  and  background  events. 

In  this  chapter  we  will  address  the  following  issues: 

1.  We  will  investigate  the  machine  intelligence  aspect  of  neural  networks  from 
several  different  points  of  view. 

2.  We  will  investigate  various  types  of  neural  networks  and  their  domain  of 
applicability. 


68 


69 


3.  We  will  present  various  training  methods  along  with  their  relative  advantages 
and  disadvantages. 

Since  it  is  beyond  the  scope  of  this  work  to  present  neural  network  theory  in 
detail,  we  only  concentrate  on  the  issues  that  are  directly  relevant  to  the  high 
energy  physics  applications  presented  in  Chapters  6,  7,  8 and  9. 

4.1  Neural  Networks  and  Machine  Intelligence 

Machine  intelligence  is  a broad  scientific  and  engineering  discipline  that  deals 
with  the  study  of  adaptive  systems  and  their  implementation  in  machines.  In  1948, 
Norbert  Wiener  called  it  cybernetics.  Since  neural  networks  are  adaptive  systems, 
neural  network  theory  is  a branch  of  cybernetics. 

The  study  of  adaptive  systems,  however,  is  much  broader.  Many  scientific 
and  engineering  disciplines  study  the  properties  of  adaptive  systems  from  different 
points  of  view.  For  example,  mathematicians  study  them  as  function  estimation 
and  dynamical  systems.  Economists  study  them  as  game  theory  and  utility  maxi- 
mization^. In  computer  science,  it  is  studied  from  the  point  of  view  of  algorithms 
and  theory  of  automata.  Engineers  call  it  adaptive  control.  Biologists  address  the 
same  issues  in  neuroscience  and  population  biology.  As  can  be  seen,  the  study  of 
adaptive  systems  is  a vast  inter-disciplinary  field. 

In  this  section  we  emphasize  the  biological  aspects  of  neural  networks,  the 
mathematical  aspects  of  function  estimation  and  dynamical  systems  as  well  as  the 

difference  between  artificial  intelligence  approaches  and  neural  networks.  Without 

^In  economics,  utility  of  a product  is  a measure  of  the  “usefulness”  of  that  product  relative  to 
another  product.  Utility  maximization  is  postulated  to  be  the  driving  force  of  consumer  choices. 
It  explains,  among  other  things,  why  most  demand  curves  have  negative  slope. 


70 


at  least  a basic  understanding  of  when  neural  networks  are  applicable,  one  cannot 
expect  to  successfully  construct  and  apply  neural  networks  to  high  energy  physics 
problems. 

4.1.1  Neural  Networks  as  a Model  of  a Nervous  System 

Neural  networks  constitute  a computer  model  of  a nervous  system.  The  basic 
motivation  for  such  modeling  is  the  ability  of  an  individual  with  a nervous  sys- 
tem to  adapt  to  a wide  variety  of  input  stimuli  and  to  produce  meaningful  and 
robust  responses.  In  our  applications,  the  input  stimuli  correspond  to  high  energy 
hadron  collider  data  and  the  responses  are  the  recognition  of  an  event  as  signal  or 
background. 

Most  higher  multi-cellular  organisms  have  a nervous  system.  Apparently  na- 
ture has  discovered  that  “hard-wiring”  of  behavior  is  not  beneficial  despite  its 
simplicity^.  Instead,  natural  evolution  has  discovered  that  it  is  more  beneficial 
to  “hard-wire”  the  development  of  a nervous  system  which  in  its  turn  can  deal 
with  changing  conditions  in  a more  robust  and  adaptive  fashion.  Nature  has  had 
about  4 billion  years  to  experiment  with  various  possibilities.  The  fact  that  the 
nervous  systems  of  all  species  are  essentially  the  same  is  an  additional  argument 
to  investigate  biological  neural  networks  in  more  detail  and  to  borrow  that  design 
from  nature. 

^What  we  mean  by  “hard-wiring”  is  the  existence  of  genes  that  have  the  effect  of  behavior. 
For  example,  the  E.  coli  bacteria  has  3 genes  that  get  activated  whenever  there  is  lactose  in  the 
environment,  but  not  glucose.  Both  lactose  and  glucose  are  energy  sources,  but  glucose  has  more 
energy  in  it.  These  3 genes  have  the  effect  of  behavior,  but  it  is  completely  “hard-wired”  in  the 
genetic  code  of  E.  coli. 


71 


The  general  structure  of  a nervous  system  for  all  species  is  shown  in  Fig.  4.1. 
The  stimulus,  or  the  sensory  input,  is  initiated  by  specialized  cells  (for  example,  the 


Figure  4.1:  The  general  structure  of  a nervous  system.  The  sensory  neurons  re- 
ceive their  input  from  specialized  sensory  cells.  The  sensory  neurons  transmit  the 
information  to  a large  network  of  interneurons.  Most  of  the  interneurons  are  in 
the  spinal  cord  and  the  brain  for  species  that  have  such  structures  at  all.  The 
interneurons  send  their  output  to  the  motor  neurons  which  activate  specialized 
cells  to  carry  out  the  response  action. 


cells  in  the  retina  of  the  eye).  Those  cells  are  connected  to  sensory  neurons  which 
transmit  the  input  information  into  the  network  of  interneurons.  The  interneurons, 
also  called  integrators,  activate  some  of  the  motor  neurons.  The  motor  neurons, 
in  their  turn,  are  connected  to  specialized  cells  (such  as  muscle  cells,  for  example) 
that  carry  out  the  response  action.  The  number  of  interneurons  is  many  orders  of 
magnitude  larger  than  both  sensory  and  motor  neurons.  In  humans,  most  of  the 
interneurons  are  in  the  spinal  cord  and  the  brain.  There  are  species,  however,  that 
do  not  have  such  structures  (such  as  worms,  for  example). 

The  most  important  feature  of  the  structure  of  the  nervous  system  is  the  ex- 
istence of  the  3 layers.  The  general  structure  of  a neural  network  is  also  layered 
as  shown  on  Fig.  4.2.  They  are  called  the  input  layer,  which  corresponds  to  the 
sensory  neurons,  the  hidden  layers,  which  correspond  to  the  interneurons,  and  the 
output  layer,  which  corresponds  to  the  motor  neurons.  For  that  reason,  it  is  not 
accurate  to  say  that  neural  networks  are  a model  of  the  brain  (as  many  authors 
do).  The  brain,  if  at  all  present,  is  only  one  component  of  the  nervous  system. 


72 


input  ^ 

input 

hidden 

ouput 

ouput 

(stimulus) 

layer 

layers 

w 

layer 

(response) 

Figure  4.2:  The  general  structure  of  a neural  network.  The  input  layer  receives 
(or  senses)  the  data.  This  information  is  then  transmitted  to  the  hidden  layers. 
Some  of  the  neurons  in  the  hidden  layers  are  connected  to  the  output  layer.  The 
meaning  of  the  response  from  the  output  layer  depends  on  the  objective  of  the 
problem  being  solved. 

Neurons  are  specialized  cells  that  are  the  building  blocks  of  a nervous  system. 
The  neurons  of  all  species  are  structurally  and  functionally  the  same.  This  is 
an  astonishing  fact  and  it  stands  to  reason  that  the  properties  of  neurons  are 
sufficiently  flexible  to  accommodate  nervous  systems  as  simple  as  those  of  worms 
and  as  complex  as  those  of  humans.  Since  the  properties  of  neurons  influence  the 
functioning  of  a neural  network,  we  first  investigate  biological  neurons  and  the  way 
in  which  they  function. 

The  general  structure  of  a biological  neuron  is  shown  in  Fig  4.3.  The  most 
important  structural  feature  is  the  fact  that  there  are  many  dendrites  which  receive 
information  from  other  cells,  but  only  one  axon  which  is  the  output  of  the  neuron. 
This  is  certainly  not  the  most  general  structure  possible.  For  example,  there  could 
be  many  axons  or  the  information  flow  could  be  bi-directional  in  which  case  there 
would  be  no  need  to  distinguish  between  dendrites  and  axons.  It  is  arguable 
that  this  structure  is  the  simplest  possible.  In  neural  networks  this  structure  is 
preserved. 

The  functionality  of  an  individual  neuron  is  also  fairly  simple.  Each  dendrite 
is  in  close  proximity  to  either  a receptor  cell  or  an  axon  ending  of  another  neuron. 


73 


dendrites 


Figure  4.3:  Shows  the  general  structure  of  a neuron  cell.  The  dendrites  are  short 
protrusions  from  the  cell  which  receive  information  from  receptor  cells  or  other 
neurons.  The  axon  is  a long  protrusion  from  the  cell  which  carries  out  the  infor- 
mation towards  the  axon  endings.  The  axon  endings  transmit  the  information  to 
other  neurons  or  to  effector  cells  (for  example,  muscle  cells). 


In  both  cases,  the  membranes  of  the  two  cells  are  in  close  proximity.  Such  junc- 
tions are  called  synapses.  Each  synapse  effectively  “conducts”  the  electro-chemical 
signal  from  the  transmitting  cell  to  the  receiving  cell.  Each  signal  is  either  exci- 
tatory or  inhibitory.  The  collection  of  signals  received  by  the  dendrites  travels  to 
the  body  of  the  cell  where  they  “compete”  for  control.  This  control  is  additive 
in  nature  since  the  signal  is  the  potential  difference  between  the  inside  and  the 
outside  of  the  membrane.  This  potential  difference  is  determined  by  the  relative 
concentrations  of  and  Na^  ions  inside  and  outside  of  the  membrane.  Under 
normal  circumstances,  the  neuron  cell  is  negative  inside  and  positive  outside.  The 
voltage  difference  (the  cathode  being  outside)  is  — 70mU.  If  the  excitatory  and 
inhibitory  signals  from  all  dendrites  exceeds  the  threshold  value  of  approximately 
— 40mU,  a positive  feedback  is  initiated  at  the  base  of  the  axon.  This  results  in 
the  potential  raising  to  about  -l-30mU  and  this  disturbance  is  propagated  along 


74 


the  axon  away  from  the  cell.  During  that  time,  the  neuron  cannot  respond  to  new 
stimuli. 

A thorough  description  of  the  electro-chemical  processes  that  take  place  in  a 
neuron  is  beyond  the  scope  of  this  work.  In  addition,  we  are  only  interested  in 
the  functional  aspects  of  a biological  neuron  since  that  is  what  we  use  to  model 
artificial  neurons.  Therefore,  from  this  functional  point  of  view,  we  can  summarize 
the  functionality  of  a biological  neuron  as  follows: 

1.  The  inputs  for  any  neuron  are  other  cells  in  close  proximity  to  the  dendrites 
of  the  neurons. 

2.  The  junctions  between  the  two  cells,  the  synapses,  determine  the  relative 
strength  of  one  input  vs.  another. 

3.  The  “competition”  between  excitatory  and  inhibitory  signals  coming  from 
the  dendrites  is  additive  in  nature.  Furthermore,  it  is  naturally  represented 
by  positive  numbers  for  excitatory  signals  and  negative  numbers  for  in- 
hibitory signals. 

4.  Due  to  the  positive  feedback  that  occurs  if  the  threshold  value  of  about 
— 40my  is  reached,  the  axon  excitation  is  an  all-or-nothing  event.  This  fact 
has  far  reaching  consequences  when  we  start  modeling  a neuron.  In  Section 
4.3  we  will  outline  the  consequences  of  deviating  from  the  biological  analog 
of  a neuron. 

5.  During  a positive  feedback  that  results  in  the  propagation  of  the  signal  along 
the  axon,  the  neuron  cannot  respond  to  new  changes  in  the  dendrite  po- 
tentials. This  has  two  consequences:  First,  biological  nervous  systems  are 


75 


not  prone  to  sustained  positive  feedbacks;  Second,  there  is  a natural  way  to 
simulate  the  action  of  a neuron  since  the  “blackout  time”  provides  a natural 
time  quantum  which  is  necessary  for  simulation  purposes^. 


Based  on  the  exposition  above,  we  can  postulate  the  following  correspondence 
principles: 

1 . Each  dendrite  represents  a synaptic  weight  that  measures  the  relative  “impor- 
tance” of  individual  cell-to-dendrite  connections.  We  will  call  that  a synaptic 
weight  for  artificial  neurons  and  we  will  represent  them  by  a real  number. 
Positive  numbers  reflects  excitatory  inputs  and  negative  numbers  reflects 
inhibitory  inputs. 

2.  Each  neuron  cell  adds  all  incoming  signals  from  the  dendrites  and  compares 
it  with  the  threshold  value.  If  a certain  (potential  difference)  level  is  reached, 
the  neuron  is  activated.  We  will  model  the  threshold  value  as  a real  valued 
parameter  of  a neuron. 

3.  The  positive  feedback  that  takes  place  at  the  base  of  the  axon  is  modeled  as 
an  activation  function.  The  simplest  activation  function  is  the  step  function^. 

A model  of  a neuron  is  shown  in  Fig.  4.4.  The  inputs  represent  the  signals  before 

^There  is  an  implicit  assumption  that  the  neuron  simulation  is  digital  in  nature.  There  is 
also  an  implicit  assumption  that  the  “black-out”  time  frame  is  the  same  under  all  circumstances. 
Both  assumptions  can  cause  artificial  neural  networks  to  deviate  from  the  functionality  of  their 
biological  counterparts.  An  analog  implementation  of  a neural  network  might  be  the  simplest 
way  for  an  accurate  modeling  of  neuron  cells.  This  is  especially  true  when  one  takes  into  account 
the  fact  that  real  neural  networks  are  completely  asynchronous.  The  activation  of  real  neurons 
depends  on  the  time  integration  of  the  input  potentials,  not  just  on  a simple  sum.  It  is  likely 
that  the  simplifications  implicit  in  the  currently  used  artificial  neural  networks  is  limiting  their 
potential  usefulness. 

^We  will  see  later  that  a step  function  cannot  be  used  with  local  training  algorithms. 


Input  1 
Input  2 
Input  3 


Output 


76 


Figure  4.4:  Shows  the  general  structure  of  an  artificial  neuron.  The  dendrites  are 
the  inputs  to  the  neuron.  The  synaptic  weights  are  the  real  numbers  wi,W2  and 
which  represent  the  relative  strength  of  an  input  signal.  The  threshold,  t,  is  added 
to  the  relative  strengths  of  all  inputs  and  the  result  fed  to  an  activation  function. 
The  result  from  the  activation  function  models  the  axon  activation. 

a synapse.  Each  input  is  weighted  by  a synaptic  weight  W{.  These  weighted  inputs 
represent  the  signal  strengths  inside  a neuron.  They  are  all  added  together  along 
with  the  threshold  value  t.  The  result  is  the  argument  of  an  activation  function  and 
models  the  axon  excitation  in  a neuron  cell.  The  value  of  the  activation  function 
represents  the  output  of  a neuron.  The  simplest  activation  function  is  the  step 
function  which  would  imply  an  all-or-nothing  event  as  is  the  case  with  neuron 
cells.  In  that  case,  if  the  result  from  the  addition  is  greater  than  zero,  the  neuron 
would  be  activated  and  vice  versa. 

Let  Xi  denote  the  ith  input  to  a neuron.  Let  the  synaptic  weight  for  that  input 
be  denoted  by  Wi.  Let  t be  the  threshold  value  of  the  neuron.  Then  if  A{x)  is  an 
activation  function,  the  output  of  the  neuron,  y,  is  the  following  function: 


n 

y = A{t-\-'^  WiXi) 
i=\ 


To  construct  a neural  network,  we  must  interconnect  many  neurons  together. 
For  example.  Fig.  4.5  shows  a neural  network  with  3 layers.  The  first  layer 


77 


Figure  4.5:  Neural  network  with  3 layers.  The  first  layer  has  “sensory”  neurons 
which  get  their  inputs  from  external  “stimuli” . The  second  layer  has  interneurons 
which  is  analogous  to  a brain,  for  example.  The  last  layer  contains  one  “motor” 
neuron.  Its  output  represents  the  response  corresponding  to  each  stimulus  on  the 
input. 

contains  “sensory”  neurons,  i.e.  the  inputs  to  those  neurons  are  external  to  the 
network.  In  our  case,  they  represent  high  energy  data  for  one  event.  The  second 
layer  contains  interneurons  and  gets  its  inputs  from  the  “sensory”  neurons.  It  is 
the  analog  of  a brain.  The  last  layer  contains  “motor”  neurons.  The  outputs  from 
these  neurons  leave  the  network.  They  represent  the  response  action  which,  in 
our  case,  indicates  whether  an  event  is  a signal  or  a background.  The  “smarter” 
a neural  network  is,  the  more  accurate  its  responses  are.  In  Section  4.3  we  will 
present  different  ways  to  train  a neural  network. 

It  should  be  noted  that  all  neurons  are  structurally  and  functionally  the  same 
regardless  whether  they  are  sensory  neurons,  interneurons  or  motor  neurons.  This 
classification  refers  to  the  source  of  the  input  signals  or  the  destination  of  the 
response.  As  pointed  out  earlier,  the  same  holds  true  for  neural  cells.  Apart  from 


78 


the  difference  in  axon  length,  the  three  types  of  neurons  are  essentially  the  same^. 
For  that  reason,  we  will  simply  use  the  term  neuron  to  refer  to  any  artificial  neuron 
regardless  of  its  role  in  the  network. 

4.1.2  Neural  Networks  as  Model-Free  Function  Estimators 

Prom  a mathematical  point  of  view,  neural  networks  are  model-free  estimators 
of  numerical  input-output  functions.  Let  X be  the  input  domain  and  Y the  output 
domain  of  a neural  network.  In  general,  X Q BP  and  Y C BP^  where  n is  the 
number  of  inputs  to  the  network  and  m the  number  of  outputs.  Let 

N:X^Y  (4.2) 

be  the  action  of  a neural  network.  Then  the  function  N{x)  is  a model-free  estimator 
of  the  input-output  (or  stimulus-response)  relationship.  It  is  model-free  since  there 
are  no  specific  assumptions  about  the  form  of  N{x).  Fisher  discriminates,  for 
example,  are  not  model-free  estimators. 

The  action  of  a neural  network,  and  thus  the  function  N{x),  depends  on  its 
topology,  the  neuron  activation  function,  the  values  of  all  thresholds  and  the  values 
of  all  synaptic  weights.  We  usually  fix  the  topology  and  the  activation  function® 
and  then  train  the  network  to  obtain  the  “best”  values  for  thresholds  and  synaptic 

“Sensory  and  motor  neurons  usually  have  very  long  axons  (sometimes  in  excess  of  Im  for 
humans)  since  the  information  has  to  be  carried  from  the  point  of  origin  to  the  spine  and  the 
brain  for  sensory  neurons  and  vice  versa  for  motor  neurons.  In  contrast,  interneurons  in  the  brain 
have  much  shorter  axons  (less  than  1mm)  since  they  are  usually  connected  to  their  immediate 
neighbors. 

®The  genetic  algorithms  described  in  the  next  chapter  can  be  used  to  vary  both  the  topology 
and  the  neuron  activation  dynamically. 


79 


weights.  Thus  we  can  think  of  the  function  N as  defined  on  the  space  X xT^  xW 
where  Th  and  W are  the  thresholds  and  weights  respectively.  The  space  Th  xW 
is  the  parameter  space  of  the  neural  network.  We  can  then  rewrite  Eq.  4.2  as: 

N{t,  w):X^Y  (4.3) 

where  t G and  w E W are  the  parameters  of  the  neural  network  function 
N.  Different  points  in  the  Th  x W space  result  in  different  maps  N{t,w).  By 
selecting  the  “best”  point  in  Th  x W , we  can  achieve  the  “best”  approximation  of 
the  input-output  relationship  X . 

In  most  applications  of  neural  networks,  the  map  X ^ T is  piecewise  contin- 
uous since  the  inputs  are  usually  real  valued  variables  representing  some  physical 
process.  For  example,  one  of  the  inputs  might  be  the  transverse  energy,  Et  , of 
a jet.  If  Et  changes  infinitesimally  while  all  other  parameters  stay  the  same,  it 
would  be  reasonable  to  expect  that  these  two  events  are  either  both  signal  or  they 
are  both  background.  Therefore,  it  is  important  that  neural  networks  be  able  to 
approximate  piecewise  continuous  functions.  This  was  established  as  a fact  in  [9] 
for  neural  networks  with  two  or  more  hidden  layers.  As  a consequence,  the  us- 
age of  neural  networks  (at  least  in  principle)  is  on  solid  mathematical  ground.  It 
should  be  pointed  out,  however,  that  the  proof  in  [9]  requires  continuous  activa- 
tion functions  for  the  neurons.  As  will  be  discussed  in  the  next  section,  this  is  not 
always  desirable.  Even  in  the  case  of  a non-continuous  activation  function,  we  can 
still  approximate  continuous  maps  X Y,  we  just  cannot  do  it  with  arbitrary 
precision.  The  latter  is  rarely  of  importance  in  real  problems. 


80 


Neural  networks  are  not  the  only  model-free  estimators.  For  example,  polyno- 
mial expansion  can  also  be  used  as  a generic  model-free  map  of  inputs  to  outputs. 
Unlike  neural  networks,  however,  it  is  much  easier  to  “overfit”  a specific  data 
sample  with  polynomials  than  it  is  with  neural  networks.  Neural  networks  are 
usually  stable  with  respect  to  changes  in  the  parameter  space.  This  is  achieved 
through  the  nonlinear  neuron  activation  functions  and  the  distribution  of  learned 
information  over  many  neurons.  In  fact,  one  can  remove  several  neurons  from  a 
neural  network  and  its  performance  will  decrease  only  slightly.  In  contrast,  deleting 
several  coefficients  in  a polynomial  expansion  is  likely  to  change  the  performance 
drastically. 

4.1.3  Neural  Networks  as  Dynamical  Systems 

A “trained”  network  has  “good”  values  for  the  thresholds  and  the  synaptic 
weights  of  the  neurons.  Let  P = x W he.  the  parameter  space  of  a neural 
network  where  the  spaces  Th  and  W represent  its  thresholds  and  synaptic  weights. 
A training  algorithm  is  a map 

T : P P (4.4) 

The  continuous  application  of  the  map  T on  the  parameter  space  defines  a nonlinear 
dynamical  system.  For  example,  T could  be  the  gradient  descent  algorithm. 

The  properties  of  the  dynamical  system  depend  on  the  network  topology,  on  the 
neuron  activation  function,  on  the  input  domain  X,  on  the  performance  measure 
applied  to  the  output  domain  F,  on  the  training  algorithm  T and  on  the  initial 
conditions  po  G P.  In  general,  it  is  very  difficult  to  investigate  such  complex 


81 


systems  rigorously.  As  a consequence,  one  is  not  certain  what  the  convergence 
properties  of  the  algorithm  T are. 

Let  A be  the  attractor  of  the  dynamical  system  T.  The  attractor  of  any  dy- 
namical system  reflects  its  asymptotic  behavior  and  is  defined  as  follows: 

A = lim  T"P  (4.5) 

The  attractor  A C P can  be  a fixed  point,  a periodic  orbit  or  chaotic  (aperiodic).  In 
our  applications,  chaotic  attractors  and  periodic  orbits  are  not  useful  since  we  are 
interested  in  finding  one  point  Pb^st  ^ P such  that  it  maximizes  the  performance 
measure  of  the  neural  network.  Fixed  point  attractors  do  provide  a point,  but 
even  then  it  is  not  clear  that  this  point  is  the  best  solution.  In  Section  4.3  we  will 
investigate  several  algorithms  and  discuss  some  of  their  properties. 

It  should  be  noted  that  in  certain  applications  chaotic  attractors  are  useful. 
This  is  generally  the  case  when  neural  networks  are  used  to  model  a time  dependent 
process.  For  example,  it  has  been  suggested  based  on  dynamical  neural  models  that 
rabbits  process  odor  information  with  chaotic  attractors[10].  In  our  applications, 
the  input  data  does  not  depend  on  time  and  we  are  only  interested  in  training 
algorithms  with  fixed  point  attractors. 

4.1.4  Neural  Networks  as  Ostensive  Learning  Systems 

Neural  networks  are  not  the  only  approach  to  machine  intelligence.  In  fact, 
the  field  of  artificial  intelligence  preceded  neural  networks  by  several  decades.  In 
that  field,  the  basic  objective  is  to  build  expert  systems.  Expert  systems  process 


82 


symbols,  not  numbers.  One  builds  a network  of  nodes  which  process  incoming 
symbolic  information  based  on  fixed  (expert)  rules.  The  output  is  also  symbolic. 
The  major  disadvantage  of  this  approach  is  the  requirement  to  first  define  the 
patterns  that  will  be  processed.  However,  this  is  not  always  possible  or  feasible. 
In  our  applications  of  neural  networks,  for  example,  we  use  numerical  data  corre- 
sponding to  collider  events.  If  we  could  define  all  the  patterns  in  the  input  data, 
the  problem  of  signal  enhancement  would  be  almost  solved.  Unfortunately,  this  is 
not  feasible  and  the  usage  of  expert  systems  is  severely  limited  (if  at  all  possible) . 

Neural  networks,  like  living  organisms,  learn  ostensively.  They  do  not  require 
definition  of  the  patterns  that  they  learn.  This  property  is  called  recognition 
without  definition[ll].  It  is  perhaps  the  single  most  important  feature  of  neural 
networks.  In  fact,  it  can  be  used  as  a test  of  whether  some  specific  models  are  miss- 
ing important  information  embedded  in  the  input  data.  For  example,  in  Chapters 
6 and  9 we  will  compare  the  performance  of  Fisher  discriminates,  linear  cuts  and 
neural  networks.  Sometimes,  the  neural  networks  perform  no  better  than  (say) 
linear  cuts.  In  that  case,  we  can  argue  that  linear  cuts  are  sufficient  as  a model. 
In  other  cases,  neural  networks  perform  better.  We  can  then  argue  that  there  is 
information  (that  we  cannot  define)  which  the  neural  networks  recognized. 

4.2  Structure  of  Neural  Networks 

When  applying  neural  networks  to  real  problems,  one  must  first  address  the 
issue  of  what  the  structure  of  the  neural  network  should  be.  That  choice  should 
reflect  the  nature  of  the  input  data  as  well  as  the  algorithm  that  will  train  the 


83 


network.  In  general,  neural  networks  differ  in  their  topology,  in  the  neuronal 
activation  function  and  the  overall  performance  measure. 

4.2.1  Network  Topology 

The  topology  of  a neural  network  has  two  aspects: 

1.  Topology  based  on  physical  closeness  between  the  individual  neurons. 

2.  Topology  based  on  the  synaptic  connections  between  neurons. 

In  biological  neural  networks,  both  topologies  are  present.  Due  to  the  physical 
limitations  imposed  by  3 spatial  dimensions,  neurons  that  are  closer  to  each  other 
are  more  likely  to  be  connected  by  synapses.  In  a software  model  of  neural  net- 
works, we  are  not  restricted  by  this  physical  limitation.  Therefore,  we  can  connect 
any  two  neurons  since  we  do  not  have  the  notion  of  physical  closeness.  It  should 
be  noted,  however,  that  if  one  desires  to  implement  a neural  network  in  hardware, 
the  network  would  be  subject  to  the  same  physical  limitations  as  real  biological 
networks.  Thus,  an  arbitrary  topology  based  on  synaptic  connections  only  is,  in 
general,  not  suitable  for  hardware. 

In  our  applications,  we  only  use  software  models  of  neural  networks  and  for  that 
reason  we  completely  ignore  topological  issues  related  to  physical  closeness.  Our 
topology  is  purely  based  on  synaptic  connections.  In  other  words,  the  topology 
of  a neural  network  is  completely  determined  by  the  specific  synaptic  connections 
between  the  individual  neurons. 

The  most  general  neural  network  would  have  synaptic  connections  between  any 
two  neurons.  The  most  general  network  with  2 neurons,  2 inputs  and  1 output 


84 


is  shown  in  Fig.  4.6.  All  inputs  are  received  by  all  neurons.  The  output  from 


Figure  4.6:  The  most  general  neural  network  with  2 inputs,  2 neurons  and  1 output. 
The  inputs  go  to  every  neuron.  The  output  from  each  neuron  goes  to  all  other 
neurons,  including  the  input  of  the  neuron  producing  the  signal.  The  single  output 
is  taken  arbitrarily  from  one  of  the  neurons. 

each  neuron  goes  to  the  input  of  every  other  neuron  including  itself.  The  single 
output  is  taken  from  one  of  the  neurons.  Due  to  the  complete  symmetry,  it  does 
not  matter  which  neuron  is  used  for  the  final  output. 

As  can  be  seen,  this  neural  network  has  4 feedback  loops.  Such  neural  networks 
are  called  feedback  neural  networks.  If  a neural  network  has  no  feedback  loops, 
it  is  called  a feedforward  neural  network.  Fig.  4.7  shows  the  feedforward  network 
derived  from  Fig.  4.6  by  removing  all  loops.  Note  that  this  network  has  no  layers 
since  all  inputs  go  to  every  neuron.  In  other  words,  we  do  not  have  any  (topological) 
distinction  between  sensory  neurons,  interneurons  and  motor  neurons. 

The  choice  of  feedback  or  feedforward  neural  networks  should  be  dictated  by 
the  nature  of  the  input  data.  If  the  input  data  is  a time  series,  then  the  most 
natural  choice  is  to  use  a feedback  neural  network.  It  should  be  pointed  out  that 
the  feedback  loops  define  implicitly  the  notion  of  time  in  a neural  network.  The 
various  time  scales  are  represented  by  the  “size”  of  the  loops  and  they  coexist 


85 


Input  1 
2 


Figure  4.7:  The  most  general  feedforward  neural  network  with  2 inputs,  2 neurons 
and  1 output.  All  loops  from  Fig.  4.6  have  been  removed.  The  inputs  still  connect 
to  every  neuron  and  thus  there  are  no  well  defined  layers. 

together  at  all  times.  For  that  reason,  the  usage  of  a feedback  neural  network 
when  the  input  is  a time  series  is  the  most  consistent  choice  since  the  objective  is 
to  discover  patterns  in  time^. 

In  our  applications,  the  input  data  is  not  a time  series  and  for  that  reason  we 
use  feedforward  neural  networks  only.  This  is  analogous  to  image  recognition,  for 
example,  since  images  are  time-independent.  Even  in  biological  neural  networks, 
the  initial  stages  of  image  recognition  are  feedforward  which  is  not  very  surprising. 
It  should  be  noted,  however,  that  feedback  neural  networks  could  be  useful  in 

certain  high  energy  physics  problems.  For  example,  certain  detectors  can  track  in 

^Unfortunately,  many  authors  use  feedforward  neural  networks  even  when  the  input  data  is 
a time  series.  This  is  especially  true  in  financial  applications  where  neural  networks  are  used  to 
improve  the  performance  of  a portfolio  of  securities.  The  notion  of  time  is  given  to  the  network 
by  simultaneously  giving  values  corresponding  to  different  times.  Such  approach  has  numerous 
disadvantages.  One  of  them  is  the  significant  increase  of  the  dimensionality  of  the  input  space. 
This  is  known  as  the  dimension  problem.  Its  essence  is  that  when  the  dimension  of  the  input 
space  is  large,  the  available  data  may  not  be  statistically  significant  since  the  input  space  would 
be  populated  very  sparsely.  Another  problem  with  “faking”  the  time  dependence  by  using  more 
inputs  is  the  (arbitrary)  pre-determination  of  the  time  scales  that  enter  the  network.  A feedback 
neural  network  would  automatically  consider  all  time  scales  at  the  same  time  and  determine  the 
time  patterns  and  the  time  scales  simultaneously.  The  most  likely  reason  for  such  inconsistent 
applications  is  the  difficulty  associated  with  the  training  of  a feedback  neural  network. 


86 


real  time  the  motion  of  a particle.  One  could  utilize  feedback  neural  networks  to 
identify  the  signature  of  a particle  based  on  the  information  of  one  or  more  tracks. 

There  is  a subset  of  feedforward  neural  networks  which  have  well  defined  layers. 
For  example,  the  network  shown  in  Fig.  4.5  is  a layered  feedforward  neural  network 
with  one  hidden  layer.  Another  example  is  shown  in  Fig.  4.8  which  was  derived 


Input  1 
2 


Figure  4.8:  The  most  general  layered  feedforward  neural  network  with  2 inputs, 
2 neurons  and  1 output.  The  inputs  to  the  lower  neuron  in  Fig.  4.7  have  been 
removed.  The  upper  neuron  is  the  input  layer  and  the  lower  neuron  is  the  output 
layer. 

from  Fig.  4.7  by  connecting  the  inputs  only  to  the  upper  neuron. 

As  was  discussed  in  the  previous  section,  layering  is  very  common  in  a ner- 
vous system.  It  is  not  clear  whether  this  is  purely  a refiection  of  the  physical 
3-dimensional  constraints  or  whether  there  is  an  actual  benefit  of  layering.  In  our 
opinion,  this  is  an  open  question  that  requires  further  investigation.  Layered  feed- 
forward neural  networks  are  the  most  popular  networks,  but  this  is  primarily  due 
to  the  fact  that  they  are  easy  to  train  with  local  algorithms.  As  we  will  show  in 
Chapter  5,  genetic  algorithms  can  be  used  to  train  any  neural  network  regardless  of 
its  topology  and,  therefore,  the  limitation  inherent  in  local  algorithms  is  removed. 

In  summary,  one  should  address  the  following  topological  issues  when  using 


neural  networks: 


87 


1.  Does  the  input  data  depend  on  time  or  not?  If  it  does,  feedback  topology  is 
likely  to  produce  better  results.  If  it  does  not,  feedforward  topology  is  likely 
to  be  better. 

2.  Is  this  neural  network  going  to  be  implemented  in  hardware  at  a later  time? 
If  the  answer  is  yes,  a physical  (3-dimensional)  topology  should  be  imposed 
as  well. 

3.  In  the  case  of  feedforward  neural  networks,  should  one  use  layers  or  not? 
This  is  an  open  question. 

4.  What  algorithms  will  train  the  neural  network?  Some  algorithms  can  only 
be  applied  to  specific  topologies  (say  feedforward  layered  neural  networks) 
and  thus  they  can  affect  the  choice  of  topology  in  a potentially  detrimental 
fashion. 

4.2.2  Neuronal  Activation 

The  choice  of  neuronal  activation  function  is  another  important  factor  in  the 
construction  of  neural  networks.  As  was  pointed  out  earlier,  the  activation  of 
neuron  cells  is  an  all-or-nothing  event.  Mathematically,  the  activation  function  of 
a neuron  cell  is  A(x)  = 6{x).  In  artificial  neurons,  however,  the  activation  function 
is  rarely  a step  function.  The  most  widely  used  function  is  the  sigmoidal  activation 
function  S{x)  defined  as 

1 

1 -1-  e 


S(x)  = 


(4.6) 


88 


Figure  4.9:  Sigmoidal  function  S(x)  for  temperatures  r = 2 (dotted  curve)  and 
r = 1 (solid  curve).  At  lower  temperatures,  S{x)  is  closer  to  a step  function. 

where  r is  a parameter.  For  small  values  of  r,  S{x)  approaches  6{x).  The  parameter 
r can  be  viewed  as  “temperature”  since  S{x)  resembles  the  energy  distribution  of 
fermions  at  a given  temperature  r.  At  zero  temperature,  S{x)  |r=o=  0{x). 

Fig.  4.9  shows  the  shape  of  S{x)  for  two  different  values  of  the  temperature 
T.  The  dotted  curve  is  for  t — 2 and  the  solid  curve  for  t — 1. 

The  most  important  reason  for  using  the  sigmoidal  function  S{x)  instead  of 
the  simpler  step  function  6{x)  is  the  desire  to  use  local  training  algorithms.  Lo- 
cal algorithms  require  that  the  derivative  of  the  activation  function  exists  almost 
everywhere  and  that  it  is  not  zero.  Furthermore,  the  error  back-propagation  algo- 
rithm requires  an  analytic  form  for  the  derivative  of  the  activation  function.  The 
sigmoidal  function  satisfies  both  conditions  while  at  the  same  time  it  is  a reason- 
able approximation  of  the  activation  in  neuron  cells.  It  should  be  noted,  however, 
that  for  sufficiently  low  values  of  r it  is  very  difficult  to  train.  The  arbitrariness  in 
determining  the  “right”  value  of  r is  a disadvantage. 


89 


Another  disadvantage  of  using  S (x)  (or  any  other  activation  function  that  is  not 
the  step  function)  is  the  difficulty  of  hardware  implementation.  The  9 function  is 
very  simple  to  implement  in  hardware  and  the  implementation  would  be  very  fast 
as  well.  In  contrast,  the  implementation  of  S{x)  in  hardware  would  be  very  costly 
in  terms  of  both  the  number  of  components  and  computation  time.  Even  though 
we  are  not  using  hardware  implementations  of  our  neural  networks,  in  our  opinion 
this  will  become  more  common  in  the  future.  For  example,  one  could  train  a neural 
network  with  the  objective  of  removing  background  events  without  sacrificing  too 
much  the  efficiency,  and  then  implement  it  in  hardware.  This  hardware  could  then 
act  as  a first  or  second  level  on-line  filter®  of  the  trillions  of  events  being  detected. 
Since  speed  would  be  of  the  essence,  9{x)  would  be  the  best  activation  function. 

There  are  other  activation  functions,  but  they  all  have  about  the  same  proper- 
ties. The  most  important  observation  is  that  S{x)  is  necessary  for  local  training 
algorithms.  In  Chapter  5 we  will  show  that  genetic  algorithms  are  a viable  alter- 
native to  local  algorithms.  They  can  train  neural  networks  equally  well  using  the 
step  activation  function. 

4.2.3  Performance  Measures 

The  set  of  all  thresholds  and  synaptic  weights  of  a neural  network,  which  is  the 
parameter  space  P,  uniquely  determines  the  output  of  the  network  for  each  input 
stimulus.  To  be  able  to  train  a neural  network,  one  needs  to  define  a performance 
measure  which  depends  on  the  output  responses.  Even  though  the  performance 

®In  hadron  colliders,  there  are  usually  so  many  events  per  second  that  it  is  not  even  possible 
to  store  all  of  them  on  (say)  tape.  Thus  it  is  necessary  to  discard  some  of  the  events  which  are 
presumed  to  be  background  with  a very  high  probability. 


90 


measure  is  not  part  of  the  structure  of  a neural  network,  it  can  affect  the  choice 
of  topology  and  neuronal  activation  indirectly,  since  certain  performance  measures 
cannot  be  used  with  local  algorithms  and  since  local  training  algorithms,  as  was 
already  discussed,  cannot  be  applied  to  any  neural  network. 

In  real  biological  neural  networks,  the  performance  of  the  nervous  system  is 
evaluated  continuously  for  each  input  stimulus.  Very  often,  this  is  achieved  by 
activating  the  pleasure-pain  center  in  the  brain  which  either  decreases  the  synaptic 
weights  of  the  neurons  that  participated  in  the  response  (in  the  case  of  pain)  or 
increases  them  (in  the  case  of  pleasure).  There  are  performance  measures  that 
simulate  this  process.  However,  it  is  beyond  the  scope  of  this  work  to  even  review 
them.  A thorough  and  rigorous  treatment  can  be  found  in  [11]. 

In  our  applications,  we  are  not  interested  in  the  performance  of  a neural  network 
for  a specific  input-output  pair.  Our  performance  measures  are  defined  on  an  entire 
set  of  input-output  pairs.  Let  us  denote  the  input  stimulus  space  as  X and  the 
output  (response)  space  as  Y.  In  our  applications  Y C.  R since  we  are  interested 
whether  an  event  is  signal  or  background  and  we  can  code  it  with  a single  number 
in  the  range  [0,1],  for  example.  Let  xi  be  the  ith  input  stimulus  where  Xi  G X. 
The  space  X corresponds  to  variables  that  specify  an  event.  Let  yi  be  the  response 
from  the  network  when  presented  with  the  input  Xi.  I.e., 

Vi  = N{xi)  (4.7) 

where  N is  the  overall  neural  network  map  from  X to  Y.  Let  dj  G F be  the  desired 
response.  In  other  words,  an  ideal  network  would  always  respond  with  di  when 


91 


presented  with  the  input  a;,.  In  our  applications,  the  desired  response  di  is: 


V 


0 


if  Xi  corresponds  to  a signal  event 
if  Xi  corresponds  to  a background  event 


Since  a neural  network  will  not  be  able  to  always  respond  correctly  (except  in  some 
very  simple  and  idealistic  cases),  any  global  performance  measure  for  a sample  of 
input  stimuli  will  have  to  be  a function  of  yi  and  di.  The  simplest  scalar  function 
can  be  defined  as  follows:  Let 


Ei  = {vi  - dif  (4.9) 

be  the  instantaneous  quadratic  “error”  between  the  actual  output  yi  = N{xi)  and 
the  desired  value  di.  The  total  mean-squared  error  E can  then  be  defined  as 

E = (Ei)  (4.10) 

= {{N{xi)  - dif)  (4.11) 

where  the  angle  brackets  denote  the  linear  mathematical  expectation  operator. 
However,  in  most  practical  cases  we  do  not  know  the  joint  probability  density 
distribution  p{x,  d)  that  defines  the  expectation  operator  (•).  If  we  assume  uniform 
distribution,  then  we  can  rewrite  the  equation  above  as 

i=i 


(4.12) 


92 


where  Ng  is  the  size  of  the  sample  of  signal  and  background  events.  It  must  be 
stressed  that  the  assumption  of  uniform  distribution  is  in  general  not  justified. 
We  will  return  to  this  problem  when  we  discuss  training  algorithms  that  use  E as 
performance  measure. 

The  mean  square  error  has  another  property  that  is  sometimes  not  desirable. 
It  punishes  individual  misclassifications  too  much.  One  can  use  the  average  of  the 
absolute  values  as  a measure  which  is  defined  as 


I Ns 

jr'ElVi-  di 

i=l 


(4.13) 


assuming,  once  again,  uniform  distribution.  In  practice,  there  is  little  difference 
between  these  two  measures  for  most  applications. 

The  mean-square  error  E has  another  disadvantage.  Misclassifications  of  signal 
events  have  the  same  contribution  as  misclassifications  of  background  events.  Yet 
in  many  cases  one  is  interested  in  the  statistical  significance  measure  for  signal 
enhancement,  Rf,  defined  as 


(4.14) 


where  Ngig  is  the  number  of  signal  events  recognized  by  the  network  and  N^ck  are 
the  number  of  background  events  that  were  not  distinguishable  from  signal  events. 
The  statistical  significance  measure  Rf  is  more  sensitive  to  misclassification  of 
signal  than  misclassification  of  background.  Whenever  possible,  it  makes  sense  to 
use  Rf  instead  of  the  mean-square  error  E. 

There  is  one  very  significant  difference  between  E and  Rf.  The  measure  E can 
be  used  with  local  training  algorithms  since  the  derivative  of  E with  respect  to  the 


93 


output  of  the  neural  network  exists  and  is  not  zero.  I.e. 


(4.15) 


almost  everywhere.  On  the  other  hand,  the  derivative  of  Rf  with  respect  to  the 
network  output  is  zero  almost  everywhere  and,  as  a consequence,  the  measure  Rf 
cannot  be  used  in  conjunction  with  local  training  algorithms. 

To  see  that  this  is  the  case,  let  us  first  consider  the  outputs,  yi,  of  the  neural 
network.  Each  yi  G [0, 1].  Let  c G [0, 1]  be  a cut  on  the  network  response.  We  will 
only  retain  events  Xi  for  which 


events.  Without  any  loss  of  generality  we  can  assume  that  c = 0.5.  The  number 
of  signal  events  that  survive  condition  4.16  is  given  by 


yi  - N{xi)  > c 


(4.16) 


Events  for  which  yi  < c will  be  discarded  since  they  are  most  likely  background 


(4.17) 


and  the  number  of  background  events  that  survive  the  cut  is  given  by 


Nbck  = Y,  “ c)(l  ~ di) 


i=l 


(4.18) 


94 


In  that  case,  the  performance  measure  Rf  can  be  written  as 

r>  ~ c)di 

Rf  = ’ (4  19) 

\jE^JiO{yi-c){l-di) 

Due  to  the  6 functions  in  4.19,  dRj/dyk  = 0 almost  everywhere  for  all  k — 
1, 2, ...,  Ng  and,  as  a result,  we  cannot  use  local  training  algorithms  with  Rf.  For 
that  reason,  we  will  use  genetic  algorithms  whenever  we  use  Rf  as  performance 
measure. 

The  mean-square  error  performance  measure  E is  the  most  popular  measure. 
This  is  most  likely  due  to  its  simplicity  and  applicability  to  local  training  algo- 
rithms. There  is  a serious  disadvantage,  however,  when  the  limitations  of  a train- 
ing algorithm  influence  the  performance  measure  to  be  used.  The  performance 
measure  should  reflect  the  objective  for  training  of  a neural  network.  It  depends 
on  the  problem  being  solved  and  reflects  what  we  want  this  neural  network  to  do. 
Application  of  a different  measure  can  potentially  decrease  the  performance  when 
the  results  are  evaluated  with  the  true  measure  that  is  associated  with  the  prob- 
lem. In  addition,  the  mean-square  error  requires  the  joint  probability  distribution 
p{x,d)  which  we  either  do  not  know  or  it  is  impractical  for  us  to  infer  it  from 
the  data  sample.  In  the  next  section,  we  will  come  back  to  these  issues  when  we 
discuss  local  training  algorithms. 

4.3  Training  Algorithms 

Neural  networks  encode  learned  information  in  their  synaptic  weights  and 
thresholds.  Training  is  the  mechanism  by  which  the  synaptic  weights  and  thresh- 


95 


olds  change  so  that  the  performance  of  the  neural  network  improves.  In  general, 
performance  reflects  the  notion  of  correct  associations  between  input  stimuli  and 
output  responses. 

The  state  of  a neural  network  is  completely  determined  by  the  set  of  values  of 
its  synaptic  weights  and  thresholds.  As  before,  we  will  call  that  set  the  parameter 
space  and  denote  it  with  P.  The  state  of  the  network,  therefore,  is  a point  p G P. 
A training  algorithm  T defines  a dynamical  system  on  P and  the  process  of  training 
is  a trajectory  of  consecutive  states  in  the  parameter  space  P.  The  objective  of  a 
training  algorithm  is  to  find  a subspace  A C P for  which  the  performance  of  the 
neural  network  is  maximized.  The  subspace  A is  the  attractor  of  the  dynamical 
system  T : P ^ P. 

Unless  a training  algorithm  has  a global  component,  there  is  no  guarantee  that 
A maximizes  the  performance  of  the  neural  network.  Even  if  there  is  a global 
component  in  T,  the  convergence  to  A may  be  impractically  slow.  As  a result,  one 
should  always  first  address  the  issue  of  what  a training  algorithm  converges  to  and 
then  address  the  issue  of  speed  of  convergence  - in  that  order. 

In  biological  neural  networks,  especially  in  complex  organisms  such  as  ourselves, 
for  example,  the  state  of  the  network  is  determined  by  3 factors: 

1.  The  genetic  makeup  of  an  individual.  This  determines  the  overall  structure 
of  the  nervous  system  and  the  mechanisms  for  its  development. 

2.  The  conditions  for  early  development  of  the  nervous  system.  This  factor  can 
promote  or  suppress  the  creation  of  synapses. 


96 


3.  Subsequent  learning  based  on  individual  experience.  This  is  a continuous 
process  that  constantly  changes  the  synaptic  weights,  i.e.  encodes  new  infor- 
mation. 

The  first  two  factors  correspond  to  the  initial  structure  of  an  artificial  neural  net- 
work. The  third  factor  corresponds  to  a training  algorithm^.  In  other  words,  if  we 
think  of  an  artificial  neural  network  as  a model  of  the  nervous  system  of  humans, 
then  the  third  factor  is  the  development  of  a child  past  the  age  of  a few  years.  The 
first  two  factors  determine  the  structure  of  the  nervous  system  which,  in  artificial 
networks,  is  a static  entity  since  we  impose  it  ourselves^*^ . 

In  living  organisms,  training  can  be  represented  schematically  as  shown  in 
Fig.  4.10.  Each  input  stimulus  is  “evaluated”  by  the  nervous  system  and  an 


Figure  4.10;  Schematic  representation  of  the  training  of  a living  organism.  An 
input  stimulus  results  into  an  output  response.  That  response  is  “evaluated”  by 
the  physical  environment  and  results  in  pleasure  or  pain.  Pleasure  increases  the 
synaptic  weights  that  participated  in  the  response  and  pain  decreases  them. 


output  response  is  generated.  This  response  is  then  implicitly  “evaluated”  by  the 

physical  environment  and  the  nervous  system  receives  a feedback  in  the  form  of 

®This  statement  is  not  entirely  accurate  in  the  context  of  genetic  algorithms.  In  the  next 
chapter  we  will,  in  fact,  model  the  genetic  makeup  of  an  individual  to  represent  a neural  network. 

^°Once  again,  genetic  algorithms  can  be  used  to  dynamically  discover  the  “optimal”  structure 
of  a neural  network.  What  we  are  describing  here  are  the  traditional  approaches  to  training. 


97 


a new  stimulus.  This  feedback  can  induce  pleasure  or  pain.  Due  to  the  internal 
electro-chemical  structure  of  the  brain,  the  synaptic  weights  that  participated  in 
the  response  are  increased  (in  the  case  of  pleasure)  or  decreased  (in  the  case  of 
pain).  In  the  first  case,  the  input-output  association  is  strengthened.  In  the  case 
of  pain,  the  input-output  association  is  weakened  which  makes  it  less  likely  for  the 
individual  to  generate  the  same  response  in  the  future  if  presented  with  a similar 
input  stimulus. 

A model  of  this  process  is  shown  in  Fig.  4.11.  The  analog  of  the  nervous  system 


Figure  4.11:  Schematic  representation  of  the  training  of  a neural  network.  The 
nervous  system  in  Fig.  4.10  is  modeled  by  a neural  network.  The  physical  en- 
vironment corresponds  to  a performance  measure.  The  pleasure/pain  center  is 
represented  by  a training  algorithm. 

is  a neural  network.  The  physical  environment  corresponds  to  the  performance 
measure  of  the  problem  being  solved.  The  pleasure/pain  center  corresponds  to  a 
training  algorithm  T which  changes  the  parameter  space  of  the  neural  network. 

As  can  be  seen  from  these  two  figures,  the  driving  force  for  learning  is  the 
evaluation  of  the  response.  In  the  case  of  living  organisms,  this  evaluation  is 
performed  passively  by  the  physical  environment.  In  the  case  of  neural  networks, 
it  reflects  the  objective  that  we  want  to  achieve  with  a neural  network. 


98 


4.3.1  Supervised  and  Unsupervised  Learning 

In  the  neural  networks  literature,  training  algorithms  are  divided  in  two  classes: 
supervised  training  algorithms  and  unsupervised  training  algorithms.  The  basic 
motivation  for  such  classification  stems  from  pattern-recognition  theory  [11]  where 
the  distinction  depends  on  whether  or  not  pattern-class  information  for  the  input 
stimuli  is  available.  More  precisely,  let  X be  the  input-stimulus  space  and  Y be 
the  output-response  space.  Then  if  it  is  not  known  in  advance  what  the  correct 
response  for  each  input-stimulus  Xi  G X should  be,  the  training  algorithm  is  said  to 
be  unsupervised.  If,  however,  it  is  known  in  advance  that  for  each  input  stimulus 
Xi  E.  X,  the  correct  (or  desired)  response  is  di  € Y,  then  the  algorithm  is  said  to 
be  supervised.  In  a sense,  knowledge  of  the  desired  responses  di  allows  supervision 
of  the  training  process  by  using  some  measure  of  performance  that  depends  on  the 
actual  responses  yi  G Y and  the  desired  responses  di  as  discussed  in  the  previous 
section. 

This  classification  of  training  algorithms  implicitly  assumes  that  the  perfor- 
mance measure  is  always  part  of  a training  algorithm.  As  was  pointed  out  in  the 
previous  section,  the  choice  of  a training  algorithm  is  not  always  compatible  with 
a specific  performance  measure.  In  the  next  chapter,  however,  we  will  see  that 
genetic  algorithms  can  be  used  with  any  performance  measure.  Therefore,  the 
attribute  supervised  (or  unsupervised)  is  not  necessarily  meaningful  when  applied 
to  a training  algorithm. 

Unsupervised  training  algorithms  also  have  a performance  measure.  It  cannot 
be  otherwise,  since  a neural  network  cannot  read  our  minds  to  determine  what 
to  learn.  What  a neural  network  can  learn  depends  on  the  performance  measure. 


99 


How  it  learns  it  (if  at  all)  depends  on  the  training  algorithm.  The  real  difference 
between  “supervision”  and  the  lack  of  it  is  in  the  performance  measure  itself.  If 
the  desired  responses  di  are  not  known,  then  the  network  estimates  the  unknown 
probability  density  function  p{x).  No  assumptions  are  made  about  p{x)  except, 
possibly,  for  some  implicit  restrictions  inherent  in  the  performance  measure  and/or 
the  training  algorithm  itself.  If,  on  the  other  hand,  the  desired  responses  are 
known,  they  constitute  an  apriori  knowledge  of  the  distribution  of  input  patterns 
p{x).  In  that  case,  the  neural  network  estimates  the  unknown  function  f : X ^ 
Y by  utilizing  the  apriori  distribution  p{x).  This  utilization  is  reflected  in  the 
performance  measure  itself  as  discussed  in  the  previous  section. 

In  living  organisms,  learning  is  both  supervised  and  unsupervised.  In  the  first 
case,  this  is  learning  when  we  know  what  the  response  should  be  and  we  are  trying 
to  master  it.  Skills,  generally,  fall  under  this  category.  In  the  second  case,  we  do 
not  know  in  advance  the  correct  response,  but  the  physical  environment  provides 
an  evaluation  measure  in  the  form  of  pain  or  pleasure  that  guides  the  learning 
process  in  the  “right”  direction. 

It  is  generally  believed  that  learning  in  the  brain  is  unsupervised[12]  on  the 
grounds  that  the  information  that  is  physically  accessible  to  a neuron  is  necessar- 
ily local.  However,  the  information  processing  in  any  nervous  system  is  distributed 
in  time  and,  consequently,  each  neuron  receives  instantaneously  information  that 
originated  at  different  places.  Therefore,  each  neuron  has  access  to  non-local  in- 
formation as  well.  One  such  case  is  the  pleasure/pain  center  in  the  brain  that 
“supervises”  the  input-output  map  of  the  nervous  system. 


100 


In  our  applications  of  neural  networks,  we  always  know  the  desired  responses 
di.  For  that  reason,  it  is  beyond  the  scope  of  this  work  to  present  in  detail  unsu- 
pervised learning.  We  disagree,  however,  with  the  division  of  training  algorithms 
as  supervised  and  unsupervised.  We  believe  that  a more  appropriate  terminology 
would  be  supervised  learning  and  unsupervised  learning,  since  it  would  reflect  the 
fact  that  supervision  (or  the  lack  of  it)  is  not  always  a property  of  the  training 
algorithm. 

4.3.2  The  Gradient  Descent  Algorithm 

The  most  popular  algorithms  for  training  of  neural  networks  are  variations  of 
the  gradient  descent  algorithm.  The  neural  networks  literature  is  dominated  by 
such  training  algorithms.  Unfortunately,  the  severe  restrictions  that  these  algo- 
rithms impose  on  the  neural  network  structure,  the  neuronal  activation  functions 
as  well  as  the  performance  measure  are  generally  not  stated  clearly,  if  at  all.  The 
restrictions  on  the  performance  measure  are  particularly  disturbing  since  one  can 
be  forced  to  train  a neural  network  by  using  a performance  measure  that  does  not 
reflect  the  nature  of  the  problem. 

In  its  “classic”  form,  the  gradient  descent  algorithm  can  be  defined  as  follows: 
Let  p G P be  the  current  state  of  the  neural  network  where  P is  the  space  of  all 
synaptic  weights  and  thresholds.  Let  M : P P be  a real  valued  performance 
measure  of  the  neural  network.  Let  G : P — >•  P be  the  gradient  descent  algorithm. 
Then  G can  be  defined  as 


G{p)  =p-  cVM 


(4.20) 


101 


where  c G i?  is  a scalar  parameter  of  the  training  algorithm.  It  may  be  fixed, 
iteration  dependent,  dependent  on  the  absolute  value  of  the  gradient  |VM|,  etc. 

The  gradient  descent  algorithm  reflects  the  notion  that  the  shortest  path  to 
a local  minimum  on  the  P x M hypersurface  is  to  follow  the  gradient  of  that 
hypersurface.  However,  there  are  a number  of  implicit  and  explicit  assumptions 
behind  any  version  of  a gradient  descent  algorithm.  They  are  as  follows: 

1.  The  neural  network  is  feedforward. 

2.  The  neuronal  activation  function  (or  functions)  must  all  be  piecewise  differ- 
entiable and  their  derivative  should  be  non-zero  almost  everywhere.  This  has 
the  undesirable  properties  of  an  arbitrary  choice  of  activation  functions,  the 
introduction  of  additional  network  parameters  via  the  activation  functions 
(such  as  the  r parameter  for  the  sigmoidal  function,  for  example)  as  well  as 
increasing  the  computational  complexity  which  renders  such  networks  less 
suitable  for  hardware  implementations. 

3.  The  performance  measure  must  also  be  piecewise  differentiable  with  a deriva- 
tive different  from  zero  almost  everywhere.  This  has  the  devastating  conse- 
quence of  not  being  able  to  use  certain  performance  measures  even  in  princi- 
ple (such  as  the  statistical  significance  measure  Rf  described  in  the  previous 
section). 

4.  There  is  no  guarantee  that  the  algorithm  will  converge  even  to  a local  mini- 
mum, let  alone  to  the  global  minimum. 

In  addition  to  these  assumptions,  there  is  another  non-trivial  assumption  when- 
ever the  performance  measure  M is  expressed  as  an  average.  As  discussed  earlier. 


102 


the  most  popular  performance  measure  is  the  mean  square  quadratic  error  E as 
defined  in  Eq.  4.11.  In  that  case,  the  measure  is  ill-defined  since  the  joint  probabil- 
ity distribution  p{x,  d)  between  inputs  and  outputs  is  not  known.  In  other  words, 
we  simply  do  not  know  what  V((yj  — means  since  the  linear  expectation  op- 
erator {•)  is  not  defined.  The  traditional  approach  is  to  define  the  gradient  descent 
algorithms  as 

G{p)  =p- di)^  (4.21) 

s i=l 

which  implies,  without  any  justification,  a uniform  joint  probability  distribution 
p{x,d).  From  a dynamical  point  of  view  (i.e.  the  dynamics  defined  by  the  map 
G on  the  parameter  space  P),  this  implies  the  substitution  of  the  deterministic 
vector  V((yi  — dj)^)  with  the  stochastic  vector  ~ diY  since  the  latter 

is  an  instantaneous  estimate  of  E based  on  the  random  outputs  pi. 

As  a consequence,  the  application  of  any  performance  measure  that  is  an  av- 
erage over  a set  of  actual  outputs  and  desired  outputs  results  in  two  additional 
problems: 

1.  The  assumption  of  uniform  joint  probability  distribution  between  inputs  and 
outputs  is  completely  unjustified  in  the  general  case.  This  can  result  in 
inferior  performance  of  a neural  network  on  this  ground  alone. 

2.  The  deterministic  dynamical  system  G : P ^ P becomes  a stochastic  process 
on  completely  artificial  grounds. 

Gradient  descent  training  algorithms  have  three  advantages:  First,  they  are 
easy  to  implement;  second,  given  certain  additional  assumptions  about  the  param- 
eter c and  using  E as  performance  measure  (see  [11]),  it  can  be  shown  that  the 


103 


attractor  is  a fixed  point;  third,  they  very  quickly  converge  to  the  attractor.  In  our 
opinion,  these  advantages  are  not  always  relevant  to  the  major  issues  involved  in 
the  construction  and  training  of  a neural  network.  We  believe  that  there  are  three 
important  issues  to  address  and  they  are  as  follows  (in  the  order  of  importance): 

1.  The  performance  measure  should  reflect  the  objective  of  the  problem  being 
solved  and  not  the  limitations  of  some  training  algorithm. 

2.  One  should  attempt  to  use  an  algorithm  that  converges  (at  least  probabilis- 
tically) to  an  attractor  A such  that  the  best  solution  p^est  ^ A. 

3.  The  convergence  to  A should  be  “as  fast”  as  possible. 

The  third  issue,  as  stated,  is  obviously  vague.  There  are  a number  of  mathe- 
matical results  for  gradient  descent  training  algorithms  that  define  the  third  issue 
precisely  (at  least  in  the  context  of  the  assumptions  related  to  neuronal  activation 
functions,  performance  measures,  etc.).  However,  since  these  results  do  not,  in 
general,  apply  to  the  first  two  criteria,  their  usage  is  limited.  In  the  next  chapter, 
we  will  introduce  a version  of  genetic  algorithms  that  guarantees  the  usage  of  any 
performance  measure  as  well  as  convergence  to  the  best  solution  pbest  (^-t  least  in 
principle).  It  does  not,  however,  address  the  third  issue  in  a mathematical  fashion. 
It  is  quite  clear  that  in  the  future  we  will  see  more  developments  in  the  area  of  neu- 
ral network  training.  The  current  state,  in  our  opinion,  is  still  empirical  in  nature. 
The  rigorous  results  usually  pertain  to  local  training  algorithms  and  there  are  so 
many  assumptions  associated  with  them  that  the  applicability  of  these  results  is 
questionable  or,  at  best,  quite  limited. 


104 


4.3.3  The  Backpropagation  Algorithm 

The  backpropagation  algorithm  is  an  efficient  realization  of  a gradient  descent 
algorithm.  It  can  only  be  applied  to  feedforward  neural  networks.  In  addition  to 
the  assumptions  that  any  gradient  descent  algorithm  must  satisfy,  which  we  pre- 
sented earlier,  the  backpropagation  algorithm  also  implies  the  following  additional 
restrictions: 

1 . We  must  know  the  analytic  form  of  the  derivatives  of  all  activation  functions 
used  in  the  neural  network. 

2.  The  performance  measure,  M{P),  must  be  defined  on  an  entire  set  of  stimu- 
lus-response pairs,  i.e.  it  must  be  global. 

3.  The  derivative  of  the  performance  measure  with  respect  to  the  network  out- 
puts must  also  be  known  analytically. 

The  sigmoidal  activation  function,  S{x),  satisfies  the  first  condition.  The  second 
condition  is  satisfied  by  both  the  mean-square  error  function  E as  well  as  the 
statistical  significance  measure  for  signal  enhancement  Rj.  The  third  condition 
is  satisfied  by  E but  not  by  Rf,  as  was  shown  in  the  previous  section.  As  a 
consequence,  a feedforward  neural  network  using  sigmoidal  activation  functions 
and  the  mean-square  error  function,  E,  as  a performance  measure  can  be  trained 
with  the  backpropagation  algorithm. 

Historical  Background 

There  are  many  misconceptions  surrounding  the  backpropagation  algorithm. 
After  it  was  first  popularized  in  [13],  it  was  believed  that  the  backpropagation 


105 


algorithm  was  the  long-awaited  breakthrough  in  machine  intelligence.  This  belief 
was  heralded  by  both  the  popular  and  technical  press  and  the  backpropagation 
algorithm  quickly  became  “the  training  algorithm”.  After  this  first  wave  of  eu- 
phoria, serious  researchers  started  criticizing  various  aspects  of  the  algorithm,  in 
particular  the  non-existent  properties  that  it  was  believed  to  have.  At  present, 
it  is  completely  clear  that  the  backpropagation  algorithm  is  nothing  else  but  an 
efficient  gradient  descent  algorithm.  Unfortunately,  there  are  still  many  miscon- 
ceptions that  persist  in  the  neural  networks  literature.  For  example,  many  authors 
(such  as  in  [14])  still  refer  to  feedforward  neural  networks  as  “backpropagation” 
networks.  They  are  confusing  a specific  network  architecture  with  a particular  way 
of  training.  An  excellent  summary  of  the  history  of  the  backpropagation  algorithm 
can  be  found  in  [11],  p.  196-199. 

Another  common  misconception  is  that  one  must  use  the  mean-square  error 
E as  performance  measure  when  applying  the  backpropagation  algorithm.  In  the 
literature  it  is  referred  to  as  the  error  backpropagation  algorithm  despite  the  fact 
that  the  backpropagation  algorithm  supports  any  performance  measure  satisfying 
the  last  two  conditions  listed  above,  not  just  E.  Such  terminology  is,  at  best, 
confusing.  It  would  be  like  saying  “sigmoidal  error  back  propagation  algorithm” 
whenever  we  use  sigmoidal  activation  functions  and  E. 

Derivation  of  the  Backpropagation  Algorithm 

Let  M be  any  global  performance  measure  defined  on  a sample  set  of  stimulus- 
response  pairs  such  that  the  derivative  of  M with  respect  to  the  network  outputs 
exists,  it  is  analytic  and  it  is  not  zero  almost  everywhere.  Let  P be  the  space  of 


106 


network  parameters  comprised  of  all  thresholds  and  synaptic  weights.  Since  the 
backpropagation  algorithm  is  a special  case  of  a gradient  descent  algorithm,  we 
can  define  the  backpropagation  algorithm  in  the  same  fashion,  i.e. 

B{p)  = p — cVpM  (4.22) 

where  B denotes  the  backpropagation  algorithm,  c is  some  (possibly  time  depen- 
dent) scalar  and  p E P.  The  essence  of  the  backpropagation  algorithm  is  the 
representation  of  VpM  in  analytic  form  by  utilizing  the  structure  of  the  space  P 
that  is  induced  by  the  feedforward  network  topology.  Without  any  loss  of  gener- 
ality, we  restrict  our  attention  to  layered  feedforward  neural  networks  since  it  is 
easier  to  express  the  formulas  in  a regular  fashion. 

Let  L be  the  number  of  layers  in  the  neural  network.  Let  Ni  denote  the  number 
of  neurons  in  layer  1.  Let  No  denote  the  number  of  inputs  to  the  network.  In 
essence,  Nq  can  be  thought  of  as  the  number  of  neurons  in  an  imaginary  layer  0 
that  consists  of  “neurons”  with  one  input,  one  output  and  the  identity  function 
relating  input  to  output.  Let  pi^n,i  be  the  synaptic  weight  corresponding  to  input 
i of  neuron  n in  layer  I,  where  I = 1, . . . , L,  n = 1,. . ,,Ni  and  i = 1, . . . , 
Furthermore,  let  pi,n,o  be  the  threshold  value  of  neuron  n in  layer  1.  In  essence,  we 
are  representing  the  threshold  value  t of  a neuron  as  a synaptic  weight  associated 
with  the  constant  input  1.  In  other  words,  instead  of  writing  the  action  of  a neuron 
as 

n 

y = A{t-\-Y^  WiXi) 

i=\ 


(4.23) 


107 


where  y is  the  output  of  the  neuron,  Wi  are  its  synaptic  weights,  Xi  are  the  inputs, 
t is  the  threshold  and  n is  the  number  of  inputs,  we  rewrite  this  equation  as 


y^A{J2PiXi)  (4.24) 

i=0 

with  the  convention  that  xq  = 1 for  all  neurons  and  that 


/ 


V 


t 

Wi 


for  i = 0 
for  i - 1, 


(4.25) 


This  convention  simplifies  the  formulas  since  we  do  not  have  to  treat  the  threshold 
values  separately.  Figure  4.12  illustrates  such  “removal”  of  the  threshold  value. 


Output 


Figure  4.12:  Shows  the  representation  of  the  threshold  value,  t,  as  a synaptic 
weight.  Wo,  corresponding  to  a constant  input  value  of  xq  = 1. 

Let  Ai^n{x)  be  the  activation  function  of  neuron  n in  layer  I and  let  us  denote 
its  derivative  with  A\^^  (x) . Then  the  gradient  of  the  performance  measure  M with 
respect  to  the  synaptic  weights  and  thresholds  in  the  last  layer  is  given  by 

dM  ^ dM  dyL,n 
^PL,n,i  dyL,n  9pL,n,i 


(4.26) 


108 


where  yi^n  denotes  the  output  from  neuron  n in  layer  1.  Using  Eq.  4.24,  the  second 
term  in  the  expression  becomes 


dyL,n 

^PL,n,i 


Nl-i 

■^L,ni  PL,n,jXL,n,j)XL,n,i 

j=0 


(4.27) 


where  xi^n,i  denotes  the  itk  input  of  neuron  n in  layer  1.  Since  the  inputs  to  a 
neuron  are  the  outputs  from  the  previous  layer  (this  is  the  place  where  we  use 
layered  feedforward  structure),  then 


(4.28) 


for  all  / = 1, . . . , L.  As  a consequence,  combining  the  last  three  equations  yields 


dM 

^PL,n,i 


dM  , 

■^L,ni  z2  PL,n,jyL-l,j)yL-l,^ 

^yL,n  j=Q 


(4.29) 


for  the  gradient  of  M with  respect  to  the  network  parameters  pi,n,i  in  the  last 
layer  L.  This  gradient  depends  on  the  outputs  from  the  previous  layers  which, 
in  their  turn,  depend  on  the  outputs  of  the  layer  preceding  them.  This  recursive 
dependence  of  the  gradient  with  respect  to  parameters  of  one  layer  on  the  gradient 
with  respect  to  parameters  of  the  next  layer  allows  a backward  “propagation” 
of  the  computation  from  the  last  layer  to  the  first  layer,  hence  the  name  of  the 
algorithm.  More  specifically,  the  gradient  with  respect  of  the  network  parameters 
in  layer  L — 1 is  given  by 


dM 


Nl 

E 


m=l 


dM  dyL,m  y 
dyL,m  dyL-i,n 


Nl-2 

( E PL-l,n,jyL-2,j)yL-2,i 

3=0 


dpL—l,n,i 


(4.30) 


109 


where  the  sum  over  m is  over  all  neurons  in  the  last  layer.  Furthermore,  since 


^Vl,m  a! 


(4.31) 


1=0 


for  any  I = 1, . . . ,L,  then  Eq.  4.30  can  be  rewritten  as 


Q . Q,  ■^L,mi^^PL,Tn,jyL-l,j)yL—l,n-^L—l,ni  PL—l,n,jyL—2,j)yL—2,i- 

Pi/  l,7l,i  771=1  yL^TTl  j—0  j—0 

(4.32) 

The  important  feature  of  this  equation  is  that  the  terms  pertaining  to  the  last  layer 
are  the  same  as  in  4.29.  Therefore,  starting  the  computation  of  the  gradient  from 
the  parameters  in  the  last  layer  and  working  “backwards”  implies  that  we  can  reuse 
the  calculations  when  computing  the  gradient  with  respect  to  parameters  in  the 
previous  layers.  This  re-usage  of  calculation  is  what  makes  the  backpropagation 
algorithm  more  efficient  than  a general  purpose  gradient  descent  algorithm.  Eqs. 
4.29  and  4.32  define  recursively  the  backpropagation  algorithm  since  each  layer 
depends  on  the  next  layer.  Due  to  the  recursion,  a closed  form  expression  cannot 
be  given  unless  the  number  of  layers,  L,  is  fixed.  Even  in  that  case,  a closed  form 
expression  would  be  so  long  that  its  value  would  be  severely  diminished. 


no 


In  most  practical  cases,  all  neurons  have  the  same  activation  function^^.  In 
that  case,  is  simply  replaced  by  A.  This  particular  case,  however,  does  not 
result  in  a faster  algorithm — just  a simpler  one  to  implement. 


Convergence  Properties 

Since  the  backpropagation  algorithm  is  just  a special  realization  of  a gradient 
descent  algorithm,  its  convergence  properties  are  subject  to  the  same  restrictions 
that  gradient  descent  algorithms  have.  In  particular,  there  is  no  guarantee  (quite 
the  contrary)  that  the  backpropagation  algorithm  will  ever  converge  to  a global 
minimum.  In  fact,  one  cannot  even  guarantee  that  its  attractor,  Ab,  is  a fixed 
point.  In  the  most  general  case,  one  should  be  prepared  to  deal  with  an  aperiodic 
(chaotic)  attractor.  Even  li  A b is  periodic,  a large  period  would  make  it  almost 
indistinguishable  from  a chaotic  attractor. 

In  most  applications,  it  is  desirable  that  Ab  he  a,  fixed  point.  The  scalar  c in 
Eq.  4.22  is  the  single  most  important  quantity  that  governs  the  convergence  of 
the  backpropagation  algorithm.  It  has  the  meaning  of  rate  of  change.  If  it  is  “too 
small”,  the  convergence  is  slow,  but  it  is  more  likely  to  (eventually)  converge  to  a 
local  minimum.  If  it  is  “too  large”,  it  is  more  likely  to  converge  to  a periodic  or 

chaotic  attractor.  If  c changes  with  time  (i.e.  with  each  iteration  of  the  algorithm), 

^'^One  example  of  justified  usage  of  different  activation  functions  would  be  a problem  for  which 
the  output-response  space,  Y , is,  for  example,  unbounded.  In  that  case,  a sigmoidal  function 
would  not  necessarily  be  appropriate.  On  the  other  hand,  the  unbounded  Y can  always  be 
represented  by  a bounded  neural  network  response  followed  by  a non-decreasing  function  that 
maps  the  network  response  to  unbounded  values.  That  function  could  be,  for  example,  part  of 
the  performance  measure.  In  addition,  as  already  discussed,  biological  neural  networks  always 
produce  bounded  signals  and  use  the  same  activation  function.  It  is  probably  best  to  use  the 
same  activation  function  for  all  neurons  and  perform  any  rescalings  outside  of  the  neural  network. 
This  applies  to  input  data  as  well.  There  is  no  compelling  reason  to  burden  the  neural  network 
with  such  trivial  issues  as  rescaling  of  input  or  output  data. 


Ill 


then  assuming  certain  functional  forms  of  c{t) , one  can  show  that  the  attractor  is  a 
fixed  point  (for  more  details,  see  [11],  for  example).  Even  in  those  cases,  however, 
the  convergence  to  the  fixed  point  is  achieved  in  a countable  number  of  iterations 
which,  in  practice,  is  not  very  meaningful. 

Another  factor  that  can  infiuence  the  convergence  is  the  choice  of  activation 
functions.  In  the  case  of  sigmoidal  functions,  which  are  the  most  popular  choice, 
low  “temperatures” , r,  imply  slow  convergence  and  vice  versa.  At  the  same  time, 
higher  values  of  r diminish  the  capability  of  a neural  network  to  generalize.  To  see 
that  this  is  the  case,  let  us  consider  a very  high  value  of  r,  say  1, 000, 000.  In  that 
case,  the  sigmoidal  function  would  resemble  a linear  function  and,  consequently, 
the  neural  network  would  be  incapable  of  learning  a nonlinear  map  X —)■¥ , since 
the  network  would  be  composed  of  linear  functions  only  and,  thus,  the  network 
map  N : X ^ Y would  be  linear  as  well.  This  dilemma  has  no  resolution  in 
the  context  of  local  algorithms,  in  general,  and  the  backpropagation  algorithm, 
in  particular.  In  our  neural  networks,  we  use  temperature  r = 1,  but  we  could 
have  equally  well  used  r = 0.985421,  for  example.  Some  training  algorithms 
make  the  parameter  r be  a time-dependent  variable.  The  rationale  is  that  at  the 
initial  stages  of  training  speed  is  more  important  than  robustness  and  that  at  later 
stages  of  training,  when  a reasonably  “good”  solution  is  already  found,  robustness 
is  more  important  than  speed  of  convergence.  This  approach  is  very  dangerous 
since  during  the  initial  stages  of  training  the  neural  network  may  not  be  capable, 
even  in  principle,  of  finding  a satisfactory  solution  due  to  the  lack  of  robustness 
implied  by  high  temperatures.  As  a consequence,  any  subsequent  improvement 
in  robustness  may  not  be  sufficient  since  the  network  is  already  forced  to  a local 


112 


minimum  during  the  stage  of  “speed”.  In  addition,  the  choice  of  the  functional 
form  of  r(t)  is  completely  arbitrary  and  outside  the  realm  of  both  the  network 
topology  and  the  training  algorithm. 

The  most  popular  choice  of  performance  measure  is  the  mean-square  quadratic 
error,  E,  introduced  in  the  previous  section.  As  already  discussed,  this  measure 
cannot  be  defined  properly  since  we  do  not  know  the  joint  probability  distribution 
p{x,  d)  between  inputs  and  desired  outputs.  As  a consequence,  this  measure  is  ap- 
proximated with  the  instantaneous  mean-square  error  assuming  uniform  distribu- 
tion. The  backpropagation  algorithm,  therefore,  becomes  a stochastic  dynamical 
system,  even  though,  by  its  original  definition,  it  is  deterministic.  This  consti- 
tutes a further  complication  in  the  dynamics  of  the  algorithm  which  is  completely 
artificial  in  nature. 

Despite  all  the  negative  consequences  associated  with  a local  algorithm,  the 
backpropagation  algorithm  is  a very  practical  local  algorithm.  It  does  not  always 
converge  to  a fixed  point,  but  when  it  does,  it  converges  quickly.  The  proper  place 
of  this  algorithm  could  be  stated  as  follows:  Whenever  we  can  model  our  problem 
with  a feedforward  neural  network,  and  the  natural  performance  measure  for  the 
problem  can  be  dilferentiated  with  respect  to  the  network  outputs,  and  there  is  no 
significant  problem  with  non-step  activation  functions,  and  we  are  willing  to  take 
the  chance  to  miss  the  global  minimum  then  the  backpropagation  algorithm  is  our 
best  choice,  since  it  is  the  most  efficient  known  algorithm  for  training  of  networks 
that  satisfy  all  of  the  conditions  above. 


113 


4.3.4  Global  Algorithms 

The  most  important  deficiency  of  local  algorithms,  including  the  backprop- 
agation  algorithm,  is  the  high  probability  for  convergence  to  a local  minimum, 
assuming  that  they  converge  at  all.  When  the  dimension  of  the  parameter  space 
is  high,  as  is  the  case  with  neural  networks,  a single  application  of  a local  algo- 
rithm is,  in  most  cases,  unsatisfactory  since  the  local  solution  is  usually  far  from 
the  global  minimum.  Consecutive  applications  of  a local  algorithm  to  the  same 
problem  is  one  alternative,  since  one  can  chose  the  best  result  from  a set  of  “runs” . 
Such  an  approach,  however,  is  completely  unsystematic  and  time  consuming. 

Another  approach  is  to  use  a global  algorithm.  From  the  point  of  view  of  conver- 
gence to  the  global  minimum,  the  ideal  global  algorithm  would  be  the  enumeration 
algorithm.  Given  certain  desired  accuracy,  one  could  compute  the  performance  for 
all  possible  points  in  the  parameter  space  P and  chose  the  best  one.  Unfortunately, 
this  algorithm  is  hopelessly  impractical  except  for  some  extremely  trivial  problems. 
For  example,  suppose  we  would  like  to  train  a relatively  small  neural  network  with 
dim{P)  = 100.  Suppose  that  each  network  parameter  is  represented  by  2 bytes, 
i.e.  16  bits.  Then  the  number  of  possible  points  p € P is  2^®°°  ~ 10^®^.  Assuming 
an  extremely  fast  computer,  each  network  computation  would  take  at  least  Ips. 
Therefore,  a complete  enumeration  of  the  parameter  space  would  require  about 
lO^^^s.  Since  the  age  of  the  universe  is  approximately  3 x 10^®s,  to  complete  such 
computation  we  would  require  10^®^  computers  running  in  parallel  for  30  billion 
years. 

Another  global  algorithm  is  the  random  algorithm.  We  select  each  point  p at 
random  and  remember  the  best  performer.  It  would  eventually  discover  the  global 


114 


minimum,  but  it  would  take  about  as  long  as  the  enumeration  algorithm.  The 
random  algorithm,  however,  is  actually  useful  in  conjunction  with  local  algorithms. 
In  Chapters  6 and  7,  we  will  use  the  best  result  from  a random  algorithm  as  a 
starting  point  for  the  backpropagation  algorithm.  Of  course,  this  approach  does  not 
guarantee  the  best  solution.  In  practice,  however,  the  random  algorithm  provides  a 
“good”  start  for  the  backpropagation  algorithm  and  the  overall  results  are  better. 

Simulated  annealing  is  another  way  to  combine  global  and  local  algorithms. 
In  this  case,  a gradient  descent  algorithm  is  combined  with  certain  amount  of 
Gaussian  noise.  The  hope  is  that  if  the  magnitude  and  the  scale  of  the  noise  are 
“correct” , one  would  be  able  to  overcome  a local  minimum  and  force  the  gradient 
descent  to  another,  presumably  better,  local  minimum.  This  approach  is  certainly 
advantageous  compared  to  using  local  algorithms  alone.  However,  the  Gaussian 
noise  turns  any  deterministic  algorithm  into  a stochastic  algorithm.  In  addition, 
the  choice  of  magnitude  and  scale  for  the  Gaussian  noise  is  a non-trivial  problem 
since  the  “shape”  of  the  parameter  space  is  normally  not  known. 

In  the  next  chapter,  we  will  present  a version  of  genetic  algorithms  that  is  a 
viable  alternative  to  local  algorithms.  Unlike  local  algorithms,  genetic  algorithms 
can  have  a global  component  that  makes  the  convergence  to  the  global  minimum 
possible.  Unlike  simulated  annealing  (or  other  similar  global  components  to  local 
algorithms) , the  genetic  algorithms  can  change  the  scale  of  the  “noise”  dynamically 
without  any  presumptions  about  the  parameter  space  P.  In  addition,  genetic 
algorithms  are  at  the  same  time  local  algorithms  whenever  P is  a topological  space. 
Since  there  is  no  notion  of  derivatives  in  genetic  algorithms,  they  can  always  be 
used  with  the  natural  performance  measure  for  a problem,  regardless  of  whether 


115 


it  is  differentiable  or  not.  In  particular,  we  will  use  the  performance  measure 
Rf  which  cannot  be  used  in  conjunction  with  the  backpropagation  algorithm,  for 
example.  The  last  property  alone  implies  that  the  domain  of  applicability  of  genetic 
algorithms  is  larger  than  that  of  any  local  algorithm. 


CHAPTER  5 

GENETIC  ALGORITHMS 


Genetic  algorithms  are  a broad  class  of  maximization  algorithms  modeled  after 
genetics  and  evolution  [15,  16].  Unlike  local  algorithms,  such  as  the  gradient 
descent  algorithm,  genetic  algorithms  are  much  less  likely  to  find  and  stay  in  a 
local  maximum.  This  is  a considerable  advantage  for  a large  class  of  problems, 
including  our  particular  applications  described  in  the  next  chapter.  At  the  same 
time,  genetic  algorithms  have  local  properties  which  makes  it  possible  to  find  and 
refine  the  “optimal”  solutions  in  a reasonable  time,  while  at  the  same  time  not 
precluding  the  possibility  that  there  might  be  an  even  better  solution. 

The  genetic  algorithm  described  here  is  generic  in  the  sense  that  there  are  no 
parameters  that  one  has  to  adjust  in  order  to  achieve  a good  convergence.  The 
algorithm  adapts  itself  to  both  the  problem  being  solved  and  the  current  stage  of 
evolution. 

In  the  first  two  sections,  we  give  an  overview  of  genetics  and  microevolution 
in  order  to  establish  the  terminology  as  well  as  to  gain  an  understanding  how 
evolution  works  in  nature.  The  latter  is  important  to  genetic  algorithms  since 
they  are  a model  of  the  evolution  of  a population.  This  overview  is  not  complete. 
We  only  emphasize  the  topics  that  have  direct  relevance  to  genetic  algorithms. 

In  the  subsequent  sections  we  investigate  in  detail  the  various  structural  and 
functional  components  of  our  genetic  algorithm.  Since  most  of  this  work  is  original, 
the  exposition  is  comprehensive. 


116 


117 


In  the  last  two  sections  we  give  two  applications  of  genetic  algorithms  - for 
performing  linear  cuts  and  for  training  neural  networks.  We  will  use  them  in  the 
next  chapter  when  we  show  how  to  enhance  the  signal  for  the  Higgs  boson  and  the 
top  quark. 


5.1  Basic  Concepts  in  Genetics 

Genetics  originated  with  the  paper  by  Gregor  Mandel  published  in  1866  in  the 
Proceedings  of  the  Society  of  Natural  Sciences  in  Brno.  In  this  paper  he  not  only 
established  the  basic  laws  of  heredity,  but  also  demonstrated  that  heredity  can  be 
studied  experimentally.  It  was  not,  however,  until  the  beginning  of  this  century 
that  other  biologists  understood  the  importance  of  his  work. 

Since  that  time,  the  advances  in  genetics  have  been  continually  accelerating.  By 
1910,  the  term  gene  was  introduced  to  represent  the  distinct  heritable  traits  passed 
from  parents  to  offspring.  It  was  also  understood  that  genes  of  higher  organisms 
are  carried  by  chromosomes — structures  within  each  cell  of  an  organism.  By  1930, 
a number  of  methods  for  genetic  analysis  were  developed,  primarily  by  the  group 
of  Thomas  Morgan.  These  methods  include  the  mapping  of  the  relative  gene 
positions  within  a chromosome  which  remains  a standard  method  until  today.  In 
1940,  Max  Delbruck,  a quantum  chemist  and  theoretical  nuclear  physicist,  founded 
an  informal  multi-disciplinary  group  of  physicists,  biologists  and  chemists  to  study 
the  physical  nature  of  the  gene.  By  1952,  it  was  experimentally  established  that 
genes  in  bacteriophages^  are  made  of  DNA  - Deoxyribonucleic  acid.  In  1953,  James 
Watson  and  Francis  Crick  discovered  the  structure  of  the  double  helix  of  DNA 


^Bacteriophages  are  viruses  that  infect  bacteria 


118 


molecules  which  further  strengthened  the  conviction  that  genes  in  alP  organisms 
are  made  of  DNA.  By  the  1970s,  the  advent  of  recombinant  DNA  technology^  lead 
to  direct  experimental  evidence  that  the  genetic  material  is  made  of  DNA  [17]. 

5.1.1  DNA  structure 

As  discussed  earlier,  it  is  now  firmly  established  that  the  physical  carrier  of  the 
genetic  information  of  most  living  organisms  is  made  of  DNA.  In  this  subsection 
we  give  an  overview  of  the  most  important  aspects  of  the  DNA  structure. 

DNA  molecules  are  polymers  of  nucleotides.  Although  there  are  many  different 
nucleotides,  only  four  of  them  participate  in  DNA  molecules  - 2’-deoxyadenosine  5’- 
triphosphate,  2’-deoxyguanosine  5’-triphosphate,  2’-deoxycytidine  5 ’-triphosphate 
and  2 ’-deoxy thymidine  5 ’-triphosphate.  Their  conventional  abbreviations  are  A, 
G,  C and  T.  Thus  each  nucleotide  of  the  DNA  molecule  represents  a digit  of  a 
number  in  base  4.  It  is  interesting  to  note  that  the  preferential  status  of  these  four 
specific  nucleotides  is  not  a coincidence[18j.  Their  chemical  properties  are  such 
that  A can  pair  with  T and  G can  pair  with  C.  This  complementary  property  is 
called  base  pairing  and  it  makes  it  possible  for  DNA  molecules  to  be  replicated. 
In  the  words  of  Watson  and  Crick:  “It  has  not  escaped  our  notice  that  the  specific 
pairing  we  have  postulated  immediately  suggests  a possible  copying  mechanism 
for  the  genetic  material.”  Base  pairing  is  also  responsible  for  the  double  helix 
structure  of  the  DNA  molecules. 

^Strictly  speaking,  almost  all  organisms,  since  the  genes  of  retroviruses  are  made  of  RNA. 
technology  that  allows  the  creation  of  DNA  molecules  in  test-tubes  by  ligating  together 
pieces  of  non-contiguous  DNA 


119 


The  “purpose”  of  each  cell  is  to  produce  certain  proteins,  at  the  right  time 
and  in  the  correct  amount.  Most  genes  carry  information  about  the  structure  of  a 
specific  protein.  Proteins  are  also  polymers,  but  their  monomers  are  amino  acids. 
There  are  a total  of  20  different  amino  acids.  Thus  a base  pair  of  nucleotides  is 
insufficient  to  code  for  an  amino  acid.  Three  base  pairs  is  the  minimum  number  of 
nucleotides  necessary  for  the  coding  of  an  amino  acid.  Not  surprisingly,  evolution 
has  arrived  at  the  same  conclusion  - every  three  base  pairs  on  the  DNA  code  for 
one  amino  acid.  This  sequence  of  three  base  pairs  is  called  codon.  A gene  consists 
of  a sequence  of  codons  which  carry  the  information  about  all  the  amino  acids  that 
build  a specific  protein.  The  DNA  molecule  consists  of  a sequence  of  genes  and 
intergenic  regions  as  shown  in  Fig.  5.1.  The  intergenic  regions  do  not  code  for 
proteins  (or  anything  else)'^. 


DNA 


Figure  5.1:  Shows  a DNA  molecule  with  its  two  polynucleotides.  The  polynu- 
cleotides are  directed  and  biological  information  is  stored  only  in  the  3'  -4  5' 
direction.  Thus  each  gene  resides  on  one  of  the  two  DNA  strands  only.  The  strand 
where  a gene  resides  is  called  the  template  strand  for  that  gene. 


There  are  3 general  types  of  organisms  - viruses,  prokaryotes  and  eukaryotes. 

Viruses  only  have  a DNA^  molecule  encapsulated  in  a protein  shell.  They  cannot 

■^It  is  interesting  to  note  that  most  of  the  DNA  in  higher  organisms  consist  of  intergenic 
regions  or  highly  repetitive  meaningless  nucleotide  sequences.  For  example,  it  is  estimated  that 
the  human  genome  contains  only  several  percent  of  useful  biological  information.  The  total 
information  is  about  6GB,  but  only  about  200MB  are  actually  useful.  Bacteria,  on  the  other 
hand,  utilize  their  DNA  quite  efficiently  - usually  more  than  80%  of  the  information  is  useful. 

^Some  viruses,  called  retroviruses,  use  RNA  molecules  to  carry  their  genetic  information.  The 
AIDS  virus  is  one  such  example.  It  is  an  open  question  in  genetics  whether  retroviruses  are 


120 


replicate  themselves  since  they  lack  the  machinery  for  transcription,  translation  as 
well  as  the  necessary  machinery  for  energy  utilization. 

Prokaryotes  are  always  single  celled  organisms  which  have  one  single  circular 
DNA  molecule  inside  the  cell  and  all  the  necessary  ingredients  for  self-sustenance 
and  replication.  Bacteria  are  prokaryotes. 

Eukaryotes  are  usually  multi-cellular  organisms  which  have  many  linear  DNA 
molecules  packaged  in  chromosomes.  Chromosomes  are  DNA  molecules  with  pro- 
teins around  them  that,  among  other  things,  allow  a very  compact  packaging  of 
the  DNA  molecules®.  In  addition,  most  eukaryotes  have  diploid  cells.  This  means 
that  they  have  twice  the  number  of  chromosomes  - one  from  one  parent,  the  other 
from  another.  That  also  means  that  for  each  gene  there  are  two  DNA  molecules 
coding  for  it.  Normally  one  of  them  is  suppressed,  the  other  is  not.  Yeast  and  man 
are  eukaryotes. 

The  genome  of  an  organism  is  the  entire  genetic  complement  of  its  cells.  This 
includes  both  variations  of  a chromosome  for  diploid  cells. 


5.1.2  Gene  Expression 

The  expression  of  genes  is  the  process  by  which  the  information  stored  in  the 

DNA  is  utilized  and  one  or  more  proteins  are  produced.  In  later  sections  we  will 

remnants  from  the  first  stages  of  life  evolution.  It  is  believed  that  during  the  first  stages  of 
evolution,  about  4 billion  years  ago,  all  genetic  material  was  stored  in  RNA  and  that  only  later 
DNA  molecules  were  synthesized  and  used  as  storage.  During  that  transitional  period,  there  must 
have  been  an  abundance  of  various  RNA  molecules,  some  of  which  might  have  become  viruses. 
Retroviruses,  therefore,  could  be  distant  descendants  of  these  early  stages  of  life  formation. 

°It  is  interesting  to  note  that  chromosomes  are  packaged  extremely  well  within  the  nucleus  of 
a cell.  For  example,  the  linear  length  of  all  DNA  molecules  in  humans  is  about  Im  long.  Yet  the 
chromosomes  are  so  well  packed  that  the  actual  storage  size  is  about  l/xm. 


121 


make  the  connection  between  biological  gene  expression  and  gene  expression  in  the 
context  of  genetic  algorithms. 

The  expression  of  genes  is  a two  stage  process.  The  gene  is  first  transcribed 
and  after  that  it  is  translated.  Transcription  refers  to  the  process  by  which  a 
region  of  the  DNA  corresponding  to  one  or  more  genes  is  copied  to  ribonucleic 
acid  (RNA).  RNA  molecules  are  also  polymers  and  they  are  very  similar  to  DNA 
molecules.  Their  monomers  are  also  nucleotides,  and  3 of  them  (A,  G and  C)  are 
the  same  as  the  nucleotides  for  the  DNA.  The  fourth  nucleotide,  abbreviated  as  U, 
replaces  the  T nucleotide  in  DNA.  The  U nucleotide  in  RNA  can  pair  with  the  A 
nucleotide  in  DNA  and  thus  the  RNA  transcript  can  be  viewed  as  an  exact  copy 
of  the  information  that  was  originally  on  the  DNA  region. 

The  next  stage  of  gene  expression,  translation,  is  the  process  by  which  all  the 
genes  on  the  RNA  molecule  are  translated  into  proteins.  This  is  accomplished  by 
replacing  each  codon  within  a gene  with  the  amino  acid  it  codes  for.  The  amino 
acids  for  each  gene  are  chained  together  and  the  end  result  is  the  production  of 
the  proteins  these  genes  are  coding  for. 

At  first  it  may  seem  redundant  that  the  DNA  information  is  first  copied  to 
RNA  and  only  after  that  the  actual  translation  to  amino  acids  is  performed.  This, 
however,  is  an  absolutely  necessary  step  due  to  the  physical  and  chemical  con- 
straints within  the  cell.  If  there  were  no  transcription,  the  amino  acids  would  have 
to  be  physically  close  to  the  DNA  molecule  in  order  for  them  to  polymerize  into 
a protein.  That  would  mean  that  the  cell  would  be  able  to  produce  very  few  pro- 
teins at  a time.  In  addition,  regulation  of  when  to  produce  what  would  probably 
be  impossible.  Gene  regulation  is  beyond  the  scope  of  this  overview.  It  is  the 


122 


least  known  aspect  of  cell  functioning.  What  is  known  is  that  before  each  gene 
(or  cluster  of  genes)  there  are  regions  on  the  DNA,  called  promoters,  that  must 
be  unobscured  by  other  molecules  in  order  for  transcription  to  even  start.  If  these 
regions  are  not  free,  no  transcriptions  of  this  gene  (or  cluster  of  genes)  occurs. 
Therefore,  that  gene  is  not  translated.  ^ 

As  it  was  mentioned  above,  transcription  to  RNA  involves  not  just  one  gene, 
but,  in  general,  a cluster  of  genes.  Such  clusters  of  genes  are  called  operons.  In 
other  words,  an  operon  is  one  or  more  genes  that  have  common  regulator.  As  a 
consequence  either  all  of  them  are  expressed  at  the  same  time,  or  none  of  them 
is  expressed.  This  is  very  significant  since  very  often  a specific  function  of  the 
cell  requires  several  different  proteins  to  be  available  at  the  same  time  in  order 
to  achieve  the  end  result.  This  is  also  significant  in  genetic  algorithms  as  it  will 
become  clear  later. 

5.1.3  Replication  of  DNA 

Life  would  be  impossible  without  replication  of  genetic  material.  As  it  will 
be  discussed  in  the  next  section,  replication  is  the  basis  for  evolution.  From  a 
chemical  point  of  view,  the  fact  that  the  nucleotides  in  both  DNA  and  RNA  com- 
plement each  other  makes  replication  possible.  It  is  conceivable  that  if  the  strength 

of  the  electromagnetic  interaction  were  a little  different,  there  might  not  be  rela- 

' Despite  all  the  advances  in  genetics,  we  are  very  far  from  understanding  how  living  organisms 
function  because  gene  regulation  is  still  poorly  understood.  Even  after  the  human  genome  project 
is  complete  and  a complete  map  of  (possibly)  all  genes  is  available,  there  is  still  much  more  work 
to  be  done.  In  a sense,  mapping  the  genes  is  the  easy  part.  How  they  are  regulated  is  the  real 
challenge  in  genetics. 


123 


lively  simple  polymers  whose  monomers  exhibit  complementary  properties.  As  a 
consequence,  life  might  not  be  possible  even  in  principle! 

Replication  of  DNA  serves  two  different  purposes.  For  single  cell  organisms, 
such  as  bacteria,  it  is  the  basis  for  the  creation  of  offspring.  The  offspring  of  most 
bacteria  are  almost  or  completely  identical  to  their  parents. 

In  multi-cellular  organisms,  DNA  replication  is  necessary  in  two  very  distinct 
cases:  First,  to  make  replicas  of  cells  in  order  to  substitute  dead  cells  or  to  achieve 
growth.  Such  cell  division  is  called  mitosis  and  the  cell  themselves  somatic  cells. 
The  second  type  of  cell  division  is  called  meiosis  and  it  is  the  mechanism  by  which 
sex  cells  are  created. 

In  the  case  of  mitosis,  the  DNA  of  the  new  cell  is  identical  to  that  of  the  parent 
cell.  Since  this  process  is  not  relevant  to  the  genetic  algorithms  described  later, 
we  skip  any  details.  In  the  case  of  meiosis,  the  new  cell  has  different  DNA  than 
the  parent  cell.  There  are  two  processes  that  are  responsible  for  the  change  in  the 
DNA  molecules  - crossover  and  mutation. 

As  it  was  said  earlier,  eukaryotes  are  almost  always  diploid  organisms.  As  a 
consequence,  the  genome  of  each  cell  contains  DNA  from  both  parents.  During 
meiosis,  the  DNA  of  the  produced  cell  becomes  a combination  of  the  DNA  from 
each  parent.  This  is  achieved  by  crossing  over  each  pair  of  chromosomes  that 
originated  from  each  parent.  Fig.  5.2  shows  the  process  of  crossover  during  the 
formation  of  a new  sex  cell.  In  this  illustration,  there  are  two  crossover  points. 
As  a consequence,  the  region  of  DNA  in  the  middle  is  inherited  from  the  father, 
while  the  left  and  right  regions  come  from  the  mother.  The  genes  in  this  new  DNA 
become  a mixture  of  the  genes  of  the  two  parents. 


124 


Mother  DNA 
Father  DNA 


DNA  in  sex  cell 


Figure  5.2:  Shows  the  formation  of  a DNA  molecule  in  sex  cells.  The  new  DNA  is 
a combination  of  the  DNA  from  the  two  parents  of  the  individual. 


In  addition  to  crossover,  during  meiosis  some  of  the  nucleotides  of  the  new 
DNA  undergo  a random  change.  The  probability  for  such  event  is  very  small.  It 
is,  however,  one  of  the  most  significant  factors  in  evolution  as  it  will  become  clear 
in  the  next  section.  This  random  change  is  called  mutation.  We  must  distinguish 
between  mutation  during  meiosis  and  mutation  due  to  adverse  environmental  fac- 
tors. In  the  first  case,  the  mutation  becomes  part  of  a new  child.  Most  mutations 
are  harmful  or  neutral®,  but  few  are  advantageous.  The  good  mutations  eventu- 
ally become  part  of  the  genetic  pool  of  a species.  Environmental  mutations,  on  the 
other  hand,  are  generally  not  inheritable  since  they  occur  in  somatic  cells.  They 
are  harmful  to  an  individual  without  the  potential  of  passing  good  mutations  to 
future  generations. 

*If  a mutation  changes  even  a single  nucleotide  in  the  DNA  which  happens  to  be  at  the  place 
where  a gene  resides,  then  the  corresponding  codon  will  be  different.  The  code  translating  codons 
into  amino  acids,  however,  is  degenerate  since  there  are  64  possible  combinations  a codon  can 
code  for,  but  only  20  amino  acids.  As  a consequence,  a mutation  may  be  neutral  in  the  sense  that 
the  final  outcome,  the  protein  this  gene  is  coding  for,  is  the  same.  If  the  mutation  changes  one 
amino  acid,  then  the  new  protein  will  have  different  properties.  The  effect  may  be  insignificant, 
such  as  slightly  different  eye  color,  or  catastrophic,  if  this  protein  is  some  very  important  enzyme. 


125 


The  same  processes  take  place  in  the  sex  cells  of  the  other  parent.  As  a con- 
sequence, a child  has  DNA  molecules  that  are  a mixture  of  the  DNA  from  its 
4 grandparents  plus  any  mutations  that  may  have  occured  during  meiosis  in  the 
parents  of  the  child. 

5.2  Basic  Concepts  in  Biological  Microevolution 

In  the  previous  section,  we  described  the  structure  of  DNA  as  well  as  the  major 
processes  involved  during  the  two  types  of  cell  division.  Microevolution  refers  to 
changes  on  the  scale  of  an  entire  population  of  individuals  over  many  generations. 
These  changes  are  a direct  consequence  of  the  underlying  cellular  mechanisms 
in  conjunction  with  the  environment  in  which  a population  resides.  In  essence, 
microevolution  is  the  dynamics  of  a population  which  is  induced  by  the  underlying 
genetic  processes  and  the  environment.  Macroevolution,  on  the  other  hand,  deals 
with  the  dynamics  of  a group  of  species  over  a very  long  time  frame,  including 
the  creation  of  new  species.  We  will  not  consider  macroevolution  since  the  genetic 
algorithms  described  later  do  not  (yet)  support  the  creation  of  new  species. 

A thorough  understanding  of  microevolution  is  absolutely  necessary  for  the 
successful  construction  and  application  of  genetic  algorithms.  At  the  same  time, 
certain  micro-evolutionary  processes  are  more  important  than  others  in  the  overall 
population  dynamics.  For  that  reason,  we  only  concentrate  on  the  processes  that 
are  most  significant  for  the  dynamics  of  a population. 

Excluding  the  case  of  identical  twins  and  clones,  every  individual  in  a pop- 
ulation of  higher  organisms  (multi-cellular  eukaryotes  reproducing  sexually)  is 


126 


unique^.  We  will  refer  to  the  collection  of  genomes  of  each  individual  in  a popu- 
lation as  the  genetic  pool  of  that  population.  Microevolution  is  the  change  of  the 
genetic  pool  over  time,  presumably  for  the  better.  It  is  very  important  to  realize, 
however,  that  “better”  does  not  necessarily  mean  “better”  for  an  individual.  The 
overall  measure  of  performance  is  the  ability  of  the  population  to  survive,  not  the 
ability  of  a specific  individual  to  survive.  In  the  words  of  Darwin:  “Individuals  do 
not  evolve,  populations  do” . This  is  an  extremely  important  feature  of  microevo- 
lution which  is  also  one  of  the  most  important  features  of  genetic  algorithms. 

For  a single  species  population  that  is  not  too  smalT®,  the  most  important 
factors  in  the  dynamics  of  the  genetic  pool  are  genetic  variation,  the  environment 
(natural  selection)  and  the  social  behavior  of  the  individuals.  We  proceed  to  in- 
vestigate these  three  factors  in  some  detail  since  they  are  also  critical  for  genetic 
algorithms. 


5.2.1  Sources  of  Genetic  Variation 

The  most  important  factor  in  microevolution  is  the  mechanism  by  which  new 
genetic  information  is  created.  Clearly,  if  such  a mechanism  did  not  exist,  evolution 
would  have  been  impossible  even  in  principle.  As  shown  in  the  previous  section, 

mutation  and  crossover  are  the  two  processes  at  molecular  level  that  can  result  in 

®It  is  estimated  that  there  are  approximately  10®°°  possible  gene  combinations  in  humans. 
Yet  there  are  less  than  10^°  humans  alive  today.  Even  accounting  for  all  the  humans  that  ever 
lived,  it  is  extremely  unlikely  that  two  humans  will  ever  have  the  same  genetic  makeup. 

^°If  the  population  is  “too  small” , significant  abberations  in  the  dynamics  occur,  none  of  which 
is  good.  For  example,  chance  alone  could  result  in  the  loss  of  valuable  genetic  traits  which,  in 
its  turn,  can  result  in  the  extinction  of  that  population.  The  critical  population  size  depends, 
among  other  things,  on  genome  size  and  the  specifics  of  reproduction. 


127 


the  change  of  genes^^.  Of  all  the  possible  sources  for  genetic  variation,  mutation 
is  the  most  important  one. 

Mutation  is  the  only  factor  that  can  create  new  genes.  All  other  factors,  in- 
cluding crossover,  can  only  make  new  combinations  of  what  already  exists  in  the 
genetic  pool.  In  viruses  and  most  bacteria^^  mutation  is  the  only  source  of  genetic 
variation.  This  is  usually  sufficient  for  these  organisms  for  two  major  reasons: 
First,  the  size  of  their  genome  is  very  small  and,  as  a consequence,  the  number  of 
possible  nucleotide  combinations  is  fairly  limited^^.  Second,  they  can  reproduce 
very  quickly  and  as  a result  undergo  many  generations  of  microevolution  in  a small 
amount  of  time.  The  E.  coli  bacteria,  for  example,  has  about  2800  genes,  the  size 
of  its  DNA  is  1MB  and  it  can  replicate  itself  in  about  20  minutes.  Humans,  on 
the  other  hand,  have  about  50,000  genes,  the  total  DNA  is  6GB  and  it  takes  6 to 
8 hours^^  to  produce  a new  cell.  The  reproduction  time  is  about  20  years.  If  mu- 
tation were  the  only  factor  in  humans,  the  genetic  changes  would  be  so  slow  that 
it  would  be  impossible  to  adapt  to  new  environmental  conditions  or  to  maintain 
a reasonably  diverse  genetic  pool.  This  observation  has  direct  bearing  on  genetic 
algorithms  since  this  is  one  of  the  most  distinguishing  feature  of  genetic  algorithms 
versus  some  other  maximization  algorithms  with  a global  component.  In  summary, 

mutation  alone  is  sufficient  for  simpler  organisms.  For  more  complicated  organ- 

There  are  other  mechanisms  for  genetic  change  on  molecular  level  one  of  which  involves  the 
transposition  of  DNA  fragments  from  one  place  to  another.  They  are,  however,  beyond  the  scope 
of  this  overview. 

^^There  are  some  bacteria  that  can  reproduce  sexually  and,  therefore,  crossover  applies  to  them 
as  well. 

^^There  are  some  viruses  that  have  as  few  as  3 genes  only! 

^‘‘it  is  interesting  to  note  that  when  DNA  is  replicated  in  eukaryotes,  the  replication  starts  at 
a huge  number  of  places  at  the  same  time.  If  the  replication  were  to  start  at  one  place  only,  it 
would  take  about  20  years  to  copy  the  entire  DNA  in  a sequential  fashion! 


128 


isms,  it  is  critical  that  mutation  exists,  but  it  is  not  sufficient  as  a factor  since  it 
is  too  slow. 

Crossover  is  the  second  important  source  of  genetic  variation.  It  implies  sex- 
ual reproduction  and  most  likely  the  predominance  of  sexual  reproduction  among 
higher  organisms  is  directly  attributable  to  the  advantages  that  crossover  pro- 
vides. Crossover  of  DNA  creates  new  combinations  of  genes,  but  it  does  not  create 
completely  new  genes.  As  a consequence,  the  new  combination  of  genes  is  likely, 
although  by  no  means  certain,  to  be  harmless  to  the  individual.  At  the  same  time, 
the  new  combination  can  be  better  than  the  4 genomes  it  originated  from.  Natural 
selection  would  then  favor  this  combination  more  than  others  and  the  new  good 
traits  will  propagate  through  future  generations.  Unlike  mutation,  crossover  is  not 
likely  to  be  harmful  and  it  is  much  faster  than  mutation.  In  contrast,  mutation  is 
likely  to  be  harmful,  but  when  it  is  not,  it  provides  the  population  with  brand  new 
genetic  material  that  can  be  further  refined  via  crossover.  In  a sense,  crossover 
bridges  the  gap  between  the  need  for  fast  genetic  change  and  the  need  for  preser- 
vation of  what  is  already  good.  A high  level  of  mutation  could  satisfy  the  first 
need,  but  it  would  be  continuously  destroying  “good”  genes.  The  other  extreme  is 
the  absence  of  mutation  and  the  reliance  on  crossover  alone.  This  would  certainly 
preserve  the  genes  in  the  genetic  pool,  but  there  would  be  no  possibility  for  radical 
change  which  would  doom  such  population  to  extinction  since  it  would  not  be  able 
to  respond  to  drastic  changes  in  the  environment. 


129 


5.2.2  Natural  Selection 

Genetic  variation  alone  cannot  improve  a population.  There  has  to  be  a way 
of  determining  which  genetic  traits  are  advantageous  to  the  population.  It  has  to 
be  emphasized  that  there  is  no  absolute  measure  that  would  allow  each  individual 
to  be  evaluated  and  then  either  discarded  or  maintained.  Different  genetic  traits, 
however,  make  individuals  differ  in  their  ability  to  cope  with  the  current  environ- 
mental conditions.  The  ones  that  are  more  adaptable  have  a higher  probability  to 
survive  and  reproduce.  This  has  the  effect  of  an  implicit  differential  measure  for 
the  genome  of  each  individual.  In  the  long  run,  the  frequency  of  the  better  traits 
increases,  while  at  the  same  time  the  worst  traits  have  a high  probability  to  be 
removed  entirely  from  the  genetic  pool  of  the  population. 

Natural  selection  is  the  differential  survival  and  reproduction  of  individuals  of  a 
population.  It  is  the  mechanism  by  which  genetic  variation  is  implicitly  evaluated 
in  the  context  of  the  current  environment.  It  is  important  to  realize  that  natural 
selection  is  not  an  active  agent.  Nature  does  not  purposefully  seek  the  “best” 
individuals  and  destroy  the  “worst”.  It  is  just  a measure  of  the  difference  in 
reproduction  and  survival.  This  difference  depends  on  the  current  environmental 
conditions  and,  as  a result,  this  differential  measure  changes  continuously  in  time. 
The  best  traits  last  year  may  no  longer  be  relevant  for  survival  this  year. 

The  two  most  important  factors  in  the  environment  that  result  in  differential 
selection  are  limited  resources  and  the  physical  conditions  of  the  environment. 
Limited  resources  do  not  allow  populations  to  grow  indefinitely.  This  results  in 
competition  among  the  members  of  a population.  This  competition  decreases  the 
chances  for  survival  of  individuals  whose  traits  are  not  as  good  as  those  of  other 


130 


individuals.  As  a consequence,  those  “unfit”  individuals  have  a smaller  chance  to 
reproduce  even  in  the  absence  of  any  other  selection  mechanism.  In  a sense,  the 
popular  translation  of  natural  selection  as  “survival  of  the  fittest”  is  not  entirely 
accurate.  It  does  not  matter  to  evolution  whether  an  individual  survives  or  not. 
What  is  important  is  whether  an  individual  reproduces  or  not.  For  example,  an 
individual  might  have  an  excellent  genome,  but  if  for  one  reason  or  another  this 
individual  cannot  reproduce,  it  would  be  completely  useless  to  the  population  - the 
genetic  information  could  not  be  utilized  in  future  generations  and  thus  it  would  not 
have  any  effect  on  evolution.  Greater  chance  of  survival  implies  greater  chance  of 
reproduction,  but  this  is  only  one  of  the  factors  that  affects  reproduction.  It  would 
be  more  accurate  to  express  the  essence  of  natural  selection  as  “reproduction  of  the 
fittest” . This  would  include  all  the  factors,  including  sexual  selection,  for  example, 
that  lead  to  a difference  in  the  reproduction  for  the  members  of  a population.  In 
genetic  algorithms  the  resource  limits  are  the  amount  of  memory  and  CPU  time 
available  to  a population. 

The  physical  conditions  of  the  environment  impose  another  restriction  on  pop- 
ulations. They  directly  affect  the  probability  for  an  individual  to  survive  and, 
consequently,  its  ability  to  reproduce.  The  death  of  “unfit”  individuals  not  only 
limits  their  survival  and  reproduction  abilities,  but  also  decreases  the  population 
size  thus  making  more  room  for  the  “fit”  individuals.  In  the  genetic  algorithms 
described  in  the  next  section,  the  environment  is  represented  by  the  input  data 
representing  high  energy  physics  processes. 


131 


5.2.3  Social  Behavior 

Social  behavior  of  the  individuals  within  a population  also  results  in  a differen- 
tial survival  and  reproduction  rate  for  its  different  members.  Strictly  speaking,  it 
is  a variation  of  natural  selection.  However,  it  only  applies  to  more  complex  organ- 
isms. Viruses,  for  example,  do  not  socialize  even  though  they  too  are  collectively 
restricted  by  both  population  size  and  the  environmental  conditions. 

It  is  impossible  to  even  review  all  the  variations  that  exist  in  social  behavior. 
One  of  the  most  important  social  interactions,  however,  relates  to  reproduction. 
This  is  known  as  sexual  selection.  Sexual  selection  refers  to  the  interaction  between 
various  members  of  a population  that  results  in  the  selection  of  a mate.  Obviously, 
it  is  implied  that  such  species  reproduce  sexually. 

Sexual  selection  has  the  effect  of  differentiating  the  probability  of  individuals 
to  reproduce.  In  most  cases,  it  does  not  affect  their  survival.  But  as  it  was  pointed 
out  earlier,  what  matters  to  evolution  is  who  reproduces,  not  who  survives.  The 
process  of  choosing  a mate  has  the  effect  of  a faster  improvement  of  the  genetic 
pool  than  it  would  have  been  possible  otherwise.  For  example,  the  males  of  many 
animal  species  compete  for  the  right  to  reproduce.  This  competition  singles  out  the 
star  performers  of  the  male  population  which  reflects  highly  desirable  genetic  traits, 
such  as  strength  or  leaderships^.  If  this  selection  were  left  to  the  environment  alone, 
it  would  take  many  more  generations  to  achieve  the  same  genetic  improvement, 

if  at  all.  From  the  point  of  view  of  speed  of  genetic  change,  sexual  selection 

is  interesting  to  note  that  females  usually  do  not  compete  as  much.  It  most  likely  has  to 
do  with  genetic  diversity.  If  each  generation  passed  the  best  genes  from  both  males  and  females 
to  the  next  one,  very  quickly  the  genetic  pool  would  start  to  become  more  homogeneous.  This 
would  endanger  the  survival  of  the  entire  population  since  it  might  not  be  able  to  adapt  to  new 
conditions  sufficiently  fast. 


132 


increases  the  effect  of  crossover.  It  is  not  surprising  that  more  complex  species 
have  more  complex  sexual  behavior.  Those  species  need  a faster  mechanism  for 
genetic  improvement  than  could  be  achieved  by  mutation  and  crossover  alone.  In 
addition,  sexual  selection  provides  a more  specific  measure  of  good  traits  which 
the  environment  itself  could  not  do.  The  individuals  in  the  genetic  algorithms 
described  in  the  next  section  reproduce  sexually  which  has  a very  positive  effect 
on  its  “convergence”  properties. 

5.3  Maximization  Problems  as  Biological  Populations 

In  the  previous  two  sections  we  reviewed  the  basic  concepts  from  biology  that 
directly  relate  to  the  genetic  algorithms  described  in  this  chapter.  We  now  con- 
centrate on  the  construction  of  genetic  algorithms  and  the  representation  of  maxi- 
mization problems  as  populations.  We  continue  to  use  biological  terminology  since 
it  would  be  both  redundant  and  less  clear  to  invent  a new  one. 

The  basic  motivation  for  genetic  algorithms  as  a computer  representation  of 
the  evolution  of  populations  is  the  ability  of  nature  to  create  extremely  complex 
organisms  without  having  had  the  time  to  explore  each  possible  combination.  In 
other  words,  life  on  earth  could  not  have  originated  or  evolved  on  the  basis  of  pure 
luck  alone.  There  is  an  ongoing  debate  about  the  origins  of  life  and  one  of  its 
aspects  is  the  probability  for  the  spontaneous  creation  of  complex  polymers.  Op- 
ponents of  natural  origin  of  life  point  out  (incorrectly)  that  since  the  probability 
for  the  spontaneous  creation  of  even  single  proteins  is  exceedingly  small  then  life 
could  not  have  originated  in  a natural  fashion[18].  Prom  an  algorithmic  point  of 
view,  the  probability  for  finding  a solution  in  a multi-dimensional  space  in  a purely 


133 


random  fashion  is  small.  The  higher  the  dimension,  the  less  likely  it  is  to  happen. 
However,  microevolution  and  crossover  (if  present)  do  provide  a mechanism  for  a 
much  faster  discovery  of  “good”  solutions.  This  mechanism  implies  the  existence 
of  a copying  mechanism  of  previous  solutions,  which  is  readily  available  in  the  form 
of  RNA  and  DNA.  In  addition,  microevolution  provides  a mechanism  for  incremen- 
tal improvements  while  at  the  same  time  not  precluding  the  possibility  of  brand 
new  information  to  enter  a population.  From  the  point  of  view  of  maximization 
algorithms,  genetic  algorithms  exhibit  both  global  and  local  properties  - exactly 
the  way  natural  evolution  does. 

To  represent  maximization  problems  as  populations  of  individuals  that  evolve, 
we  will  model  the  functional  aspects  of  genetics  and  microevolution.  In  other 
words,  we  are  not  interested  in  modeling  the  physical  structure  of  polymers,  cells, 
etc.  We  will  only  model  the  genetic  and  evolutionary  processes  that  result  into  the 
dynamics  of  a population  which  improves  with  each  generation. 

Let  use  consider  the  real-valued  map  M defined  as  follows: 

M{P)  :D-^R  (5.1) 

Here  D represents  the  input  data  space  and  P is  the  space  of  parameters  for  the 
map  M.  In  other  words,  M is  a real-valued  parametric  model  for  the  data  space 
D.  We  are  interested  in  finding  the  maximum  value  for  M,  i.e.  finding  the  point 
Po  E P for  which 


M{po)  > M{p) 


(6.2) 


134 


for  all  p E P.  If  we  were  using  traditional  algorithms  for  finding  the  maximum,  we 
would  then  be  interested  in  the  dynamical  system  defined  by  the  map 

T : P -)>  P (5.3) 

where  T represents  some  “training”  algorithm  (such  as  gradient  descent,  for  ex- 
ample). The  best  solution  in  that  case  would  be  a fixed  point  of  the  map  T (if  it 
exists),  i.e. 

Po  = Tpo  (5.4) 

In  genetic  algorithms  however,  we  do  not  evolve  single  individuals.  We  evolve 
entire  populations.  To  model  a population  of  individuals  representing  the  max- 
imization problem  above,  we  will  represent  each  point  p E P as  genes,  we  will 
model  sexual  reproduction  (which  implies  both  mutations  and  crossover  during 
reproduction)  and  finally,  we  will  model  microevolution  on  a set  of  individuals. 

5.3.1  Modeling  of  Genetic  Information 

Individuals  in  a biological  population  that  are  more  “fit”  than  others  have  “bet- 
ter” DNA.  In  the  maximization  problem,  points  p E P that  correspond  to  a higher 
value  of  M (p)  are  better  solutions.  We  can  then  use  the  numerical  representation 
of  a point  p E P as  part  of  the  genetic  material  of  an  individual  in  the  simulated 
environment. 

It  should  be  emphasized  that  although  each  point  p E P of  the  parameter 
space  is  represented  by  a set  of  operons,  the  complete  genome  of  an  individual  is 
generally  larger  than  the  space  P.  In  other  words,  there  are  genes  that  do  not 


135 


directly  participate  in  the  evaluation  measure  defined  by  the  model  M.  This  is 
quite  different  from  more  simplistic  implementations  of  genetic  algorithms  where 
the  space  P is  the  complete  genome.  The  genes  that  are  not  part  of  P are  dynamical 
variables  of  the  evolution  that,  in  general,  affect  the  convergence  properties  of 
the  algorithm.  For  example,  mutation  is  one  such  gene.  In  biological  organisms 
there  are  also  genes  that  are  not  directly  related  to  the  survival  or  reproduction 
capabilities  of  an  individual.  In  other  words,  not  all  genes  have  direct  external 
manifestation  that  would  affect  the  specific  individual  carrying  them.  Those  genes 
still  affect  the  dynamics  of  the  population  in  the  long  run. 

In  genetics,  the  most  elementary  unit  of  information  is  a base  pair  of  nu- 
cleotides. In  computers,  the  most  elementary  unit  of  information  is  a bit.  It 
would  then  make  sense  to  say  that  a base  pair  is  analogous  to  a bit.  Figures 
5.3  and  5.4  show  the  correspondence  between  the  biological  DNA  and  the  DNA 
modeled  in  the  computer^®. 

The  next  informational  structure  in  genetics  is  the  codon.  It  is  a sequence  of  3 
base  pairs  coding  for  one  amino  acid.  In  computers,  numbers  are  represented  by 
bytes.  We  could  then  identify  a codon  as  a byte  in  the  computer. 

A sequence  of  codons  forms  a complete  gene.  When  this  gene  is  expressed,  it 
corresponds  to  a complete  protein.  In  computers,  a sequence  of  bytes  represents 
a number.  We  thus  identify  a gene  with  a number.  It  should  be  noted  that 
if  we  are  dealing  with  real  numbers,  we  should  try  to  avoid  using  floating  or 
double  precision  representation.  In  most  cases,  the  numbers  are  restricted  in  some 

range  or  can  be  made  so  by  suitable  rescalings.  If  we  were  to  use  the  computer 

^®Note  that  we  are  not  modeling  the  intergenic  regions  in  the  real  DNA.  This  is  one  example 
of  functional  (vs.  structural)  modeling. 


136 


Gene  1 


Codons 


Gene  2 


i: 

T 

J_  _ 

T 

DNA 


I V 


QO 


1 ' 

1 \ 

\ 

Intergenic 

Protein  1 
2 amino  acids 


QOO 

Protein  2 
3 amino  acids 


Base  pairs 


Figure  5.3:  Shows  a biological  DNA  with  two  genes  and  the  corresponding  expres- 
sion of  these  genes  to  proteins.  The  most  elementary  unit  of  information  is  the 
base  pair.  A codon  is  3 base  pairs  coding  for  one  amino  acid.  A sequence  of  codons 
is  a gene  coding  for  a complete  protein. 


Gene  1 B^es  Gene  2 


2 bytes  3 bytes 


DNA 


Figure  5.4:  Shows  a 5 byte  long  DNA  in  genetic  algorithms  with  two  genes  and  the 
corresponding  expression  of  these  genes  to  numbers.  The  most  elementary  unit  of 
information  is  the  bit.  A byte  is  8 bits.  A sequence  of  bytes  is  a gene  coding  for  a 
number. 


f 


137 

representations  for  real  numbers,  we  would  have  a lot  of  redundant  bits.  Put 
differently,  microevolution  would  have  to  discover  the  meaningful  interval  of  values 
for  that  gene.  If  we  know  what  that  range  is,  we  should  map  that  range  to  some 
integer  representation  and  use  the  integer  representation  as  the  gene.  For  example, 
suppose  we  are  interested  in  numbers  x G [0, 1].  If  we  were  to  use  floating  point 
representation,  the  size  of  the  gene  would  be  4 bytes  but  the  accuracy  would  be  only 
23  bits  corresponding  to  the  significand  field  in  the  single  precision  floating  point 
representation[19].  In  addition,  the  genetic  algorithms  would  have  to  discover  that 
the  remaining  9 bits  (corresponding  to  the  sign  and  the  exponent)  should  be  fixed, 
i.e.  that  x is  restricted  in  the  [0, 1]  range.  In  contrast,  if  we  were  to  map  the  [0, 1] 
interval  onto  an  unsigned  integer  of  size  4 bytes,  we  would  have  an  accuracy  of  the 
full  32  bits  and  the  genetic  algorithms  would  not  have  to  waste  time  discovering 
what  the  range  is. 

The  next  step  of  modeling  is  a set  of  genes.  A sequence  of  related  genes  is  called 
an  operon.  We  can  model  an  operon  by  simply  combining  the  representation  of 
several  numbers  in  a contiguous  sequence  of  bits.  The  set  of  all  operons  forms  a 
complete  DNA  molecule,  which  in  eukaryotes  is  encapsulated  as  a chromosome. 
We  model  the  chromosome  by  putting  the  representation  of  all  operons  one  after 
the  other.  We  will  discuss  the  role  of  operons  in  more  detail  when  we  investigate 
the  effects  of  crossover  in  genetic  algorithms. 

The  final  structure  is  the  set  of  all  chromosomes.  This  is  the  genome  of  an  indi- 
vidual. In  computers,  the  set  of  all  computer  chromosomes  completely  represents  a 
point  p G P as  well  as  any  other  genes  that  are  not  part  of  the  parameter  space  P. 


138 


Thus  solving  a maximization  problem  implies  finding  an  individual  whose  genome 
corresponds  to  the  point  po. 

5.3.2  Modeling  of  Genetic  Variation 

As  discussed  earlier,  the  two  most  important  sources  for  genetic  variation  are 
mutation  and  crossover.  Mutation  randomly  changes  a base  pair,  while  crossover  is 
the  process  by  which  two  DNA  molecules  form  a new  DNA  molecule  that  consists 
of  alternating  regions  of  DNA  originating  from  the  first  or  second  DNA  as  shown 
in  Fig.  5.2.  We  model  crossover  in  genetic  algorithms  in  a similar  way.  For  each 
chromosome  in  the  genome  of  the  species,  we  take  one  chromosome  from  the  mother 
and  another  from  the  father.  These  chromosomes  are  a string  of  bits.  We  then 
decide  whether  to  perform  crossover  at  all  by  using  the  value  of  the  crossover  rate, 
Rc.  This  is  the  probability  for  crossover  per  chromosome  (not  the  entire  genome). 
If  there  is  no  crossover,  the  child  DNA  is  identical  to  that  of  one  of  the  parents. 
In  case  there  is  crossover,  we  randomly  select  a place  on  the  DNA  where  crossover 
will  occur.  The  child  DNA  then  either  consists  of  the  first  portion  coming  from 
the  mother  and  the  second  from  the  father,  or  vice  versa  as  shown  in  Fig.  5.5.  As 
a consequence,  some  genes  come  from  the  mother  and  some  come  from  the  father. 
In  addition,  the  crossover  point  is  most  likely  not  on  the  boundary  between  two 
genes.  As  a consequence,  the  gene  that  contains  the  crossover  point  will  change. 
The  high  bits  come  from  one  of  the  parents  and  the  low  bits  come  from  the  other 
parent.  If  the  high  bits  of  both  parents  are  the  same  for  this  gene  and  if  this  gene 
represents  a real  number,  then  the  variation  in  the  low  bits  has  the  effect  of  a local 
algorithm. 


139 


Mother  DNA 
Father  DNA 


Child  DNA 


Figure  5.5:  Shows  the  crossover  mechanism  in  genetic  algorithms.  A point  for 
crossover  is  selected  randomly.  The  child’s  DNA  becomes  one  of  the  two  possible 
combinations  of  mother  and  father  DNA. 


After  crossover,  the  child’s  DNA  is  mutated  as  shown  in  Fig.  5.6.  The 


Gene  1 


Mutations 


Gene  2 


Number  1 
2 bytes 


V 

Number  2 
3 bytes 


A 

Bits 


DNA 


Figure  5.6:  Shows  mutation  on  a 5 byte  long  DNA.  Two  of  the  bits  are  changed 
randomly.  The  number  of  bits  changed  depends  on  the  mutation  rate  Rm-  The 
positions  of  the  bits  are  random. 


mutation  rate,  Rm,  determines  the  probability  for  mutation  per  bit^^.  Thus  there 
can  be  more  than  one  mutation  for  a chromosome.  Each  mutation  consists  of  a 
random  change  of  the  value  of  one  of  the  bits  on  the  chromosome.  Each  single 
mutation  alfects  only  one  gene.  For  high  values  of  Rm  mutation  acts  as  a random 
algorithm. 

'^^This  definition  of  mutation  rate  has  the  advantage  that  it  does  not  depend  on  the  chromosome 
sizes  and,  therefore,  has  the  same  meaning  for  both  small  and  large  DNA. 


140 


5.3.3  Modeling  of  Microevolution 

The  easiest  way  to  model  microevolution  in  the  computer  is  to  have  an  ar- 
ray of  strings  where  each  string  represents  the  complete  genome  of  an  individual. 
Evolution  can  be  achieved  by  an  algorithm  that  selects  pairs  of  strings,  performs 
crossover  and  mutation  and  decides  which  individuals  should  die  in  the  current 
generation.  This  decision  is  based  on  a performance  measure  that  depends  on  the 
genome  of  each  individual.  This  approach  is  quite  common,  but  it  is  very  inflexible 
and,  generally,  has  poor  convergence  properties  as  will  be  discussed  in  the  next 
section. 

We  have  taken  a different  approach.  We  model  the  environment.  We  model 
various  individuals  that  can  “live”  in  it.  Individuals  have  behavior  that  does 
not  always  depend  on  their  genome.  There  is  an  implicit  population  control. 
Each  individual  has  a performance  measure  which  is  unknown  to  the  environment. 
The  genome  of  an  individual  can  have  more  than  one  chromosome.  Finally,  each 
individual  can  be  trained  by  some  embedded  training  algorithm,  such  as  gradient 
descent,  for  example. 

This  approach  is  much  more  complicated  to  implement,  but  it  is  both  more 
flexible  and  orders  of  magnitude  more  efficient.  The  guiding  principle  in  the  model 
has  been  to  be  as  close  as  possible  to  the  real  biological  populations  and  to  utilize 
the  billions  of  years  of  experience  that  nature  has  had.  Due  to  the  complexity 
of  the  model,  an  object-oriented  implementation  would  be  the  best  choice.  Since 
efficiency  is  also  very  important,  C-f— I-  has  been  chosen  since  it  is  the  only  lan- 
guage providing  both  high  level  of  abstraction  and  system  programming  level  of 
efficiency  [20]. 


141 


The  first  important  aspect  of  the  model  is  that  the  environment  and  the  indi- 
viduals are  almost  completely  separated  from  each  other.  For  example,  it  is  not 
the  environment  that  decides  whether  an  individual  will  reproduce  - the  individual 
itself  makes  that  decision.  The  only  relationship  between  the  environment  and  the 
individuals  is  that  of  compatibility. 

The  physical  properties  of  the  environment  are  represented  by  the  data  sample, 
D,  on  which  each  individual  operates  to  compute  its  performance.  In  biological 
populations,  individuals  that  can  adapt  more  to  the  physical  environment  are 
more  fit.  In  the  model,  fitter  individuals  have  a higher  performance  value  given  by 
M{D,p),  where  p € P is  a subset  of  the  complete  genome  of  an  individual.  Since 
the  data  sample  D represents  the  physical  properties  of  the  environment  and  since 
it  is  used  in  the  computation  of  the  performance  for  each  individual,  there  must 
be  compatibility  between  the  environment  and  the  individual.  For  example,  if  our 
data  sample  is  a table  of  5 variables,  individuals  that  expect  10  variables  would 
not  be  able  to  live  in  it  - exactly  the  same  way  a fish  cannot  live  on  land. 

This  relationship  is  expressed  by  the  C-f- 1-  class  diagram[20,  21]  shown  in  Fig. 
5.7.  The  base  class  EnvironmentalObject  serves  two  major  purposes:  First,  it 
ensures  compatibility  between  an  individual  and  the  environment.  Second,  it  serves 
as  a base  class  for  polymorphism  with  respect  to  different  species  of  individuals, 
as  will  be  shown  in  the  last  two  sections. 

The  second  aspect  of  the  model  is  simulation  of  parallelism  with  respect  to 
individuals.  The  environment  does  not  make  decisions  on  behalf  of  the  individuals. 
It  is  the  individuals  that  play  the  active  role,  exactly  as  it  is  in  real  biological 
populations.  This  is  a very  important  factor  for  at  least  the  following  reasons; 


142 


Figure  5.7:  C++  class  diagram  for  the  relationship  between  environment  and  any 
individual.  The  base  class,  EnvironmentalObject,  specifies  the  data  sample  D. 
The  derived  classes  Environment  and  Individual  both  use  the  same  D. 

• Since  the  environment  is  passive,  it  can  be  implemented  completely  generi- 
cally  without  any  to  regard  what  individuals  will  live  in  it  (linear  cuts,  neural 
networks,  whatever).  This  increases  the  flexibility  and  the  reliability  of  the 
application. 

• There  can  be  no  meaningful  global  population  control  policy  that  is  good 
for  all  species  and  at  all  stages  of  evolution.  Since  it  is  the  individuals  that 
decide  when  to  reproduce,  population  control  is  implicit  and  it  adapts  itself 
to  the  species  and  the  stage  of  evolution.  This  has  the  effect  of  more  efficient 
utilization  of  CPU  resources.  We  will  give  more  details  in  the  next  section. 

• The  separation  of  environment  and  individuals  makes  it  possible  to  easily 
parallelize  the  computations  for  very  CPU  intensive  problems. 

• In  conjunction  with  the  polymorphism  on  individuals,  the  separation  of  en- 
vironment allows  the  simultaneous  evolution  of  different  species.  One  could, 
for  example,  investigate  the  same  data  sample  D with  two  different  mod- 
els, Ml  and  M2  as  long  as  they  use  the  same  performance  measure.  The 


143 


two  models  would  be  two  different  species  competing  with  each  other.  The 
species  that  wins  the  battle  not  only  provides  us  with  a solution,  but  also 
with  the  additional  information  which  model  is  better. 

• The  independence  of  individuals  suggests  a straight-forward  extension  to 
include  migration.  This  would  not  only  parallelize  the  computations,  but 
also  would  have  the  additional  advantage  of  increasing  the  genetic  diversity 
of  the  species  which  would  make  the  algorithm  even  more  robust. 

• The  performance  of  each  individual  is  not  a global  measure  of  fitness.  The 
individuals  themselves  are  evaluating  each  other  during  sexual  selection.  This 
has  the  interesting  consequence  that  one  does  not  necessarily  need  to  have 
a real  valued  performance  measure.  In  principle,  the  evaluation  could  be 
with  respect  to  more  than  one  factor.  Clearly,  a traditional  maximization 
algorithm  cannot  handle  this  situation  even  in  principle,  since  only  numbers 
on  the  real  line  can  be  ordered. 

The  environment  is  an  object-oriented  implementation  of  a mini  non-preempt- 
ive  operating  system[22]  for  individuals  that  derive  from  the  class  Environmen- 
talObject^*.  During  one  generation,  each  individual  receives  execution  from  the 
environment.  The  individual  then  makes  a decision  whether  to  reproduce.  This 
decision  depends  on  the  relative  population  size.  If  the  individual  decides  to  re- 
produce, it  selects  a mate  and  produces  an  offspring.  In  addition,  the  individual 

A non-preemptive  operating  system  cannot  stop  the  execution  of  a process  once  it  is  running. 
It  is  the  responsibility  of  the  process  (in  our  case  the  object)  to  give  back  execution  to  the 
operating  system.  A preemptive  operating  system,  on  the  other  hand,  can  stop  a process  at 
any  time.  We  could  have  implemented  a preemptive  policy,  but  it  would  have  been  much  more 
complicated  and  less  portable  without  any  significant  benefits. 


144 


can  die  of  old  age  (which  contributes  to  the  implicit  population  control).  In  the 
next  section  we  describe  in  more  detail  how  these  actions  affect  the  convergence 
properties  of  the  algorithm. 

5.4  Convergence  Factors  in  Genetic  Algorithms 

Genetic  algorithms  are  useful  for  a variety  of  reasons.  One  of  their  most  impor- 
tant features  is  the  fact  that  they  are  an  universal  maximization  algorithm  with 
both  global  and  local  properties.  The  latter  is  a very  important  advantage  since 
there  are  a number  of  problems  for  which  the  properties  of  the  parameter  space 
P are  not  known  or  impractical  to  investigate.  In  particular,  one  generally  does 
not  know  how  many  local  maxima  there  are  and  how  “wide”  the  peaks  of  the  map 
M{D)  : P R are.  As  a consequence,  the  usage  of  non-universal  algorithms  is 
likely  to  lead  to  inferior  solutions  corresponding  to  either  local  maxima  or  to  local 
flat  regions.  The  usage  of  a strictly  local  algorithm  (such  as  gradient  descent,  for 
example)  on  a multi-dimensional  space  P is  in  many  cases  inadequate.  Attempts 
to  “globalize”  local  algorithms  are  often  inadequate  as  well.  For  example,  the  gra- 
dient descend  algorithm  can  be  combined  with  Gaussian  noise  (which  is  known 
as  simulated  annealing)  in  order  to  make  it  more  “global”  [1 1] . Unfortunately,  the 
Gaussian  noise  has  a scale  associated  with  it  and  unless  we  know  the  properties 
of  the  hypersurface  M{D,  P),  the  choice  of  the  scale  for  the  Gaussian  noise  would 
be  completely  arbitrary.  Even  though  there  would  be  some  improvement  in  the 
algorithm,  it  is  generally  not  likely  that  the  algorithm  will  converge  to  the  best 
solution  even  in  principle. 


145 


Genetic  algorithms,  on  the  other  hand,  do  not  have  to  assume  any  particular 
scales  associated  with  the  hypersurface  M{D,  P).  At  the  initial  stages  of  evolution, 
all  scales  are  investigated  at  the  same  time.  The  algorithm  acts  as  a random 
algorithm.  At  later  stages  of  evolution,  however,  certain  scales  are  represented 
more  in  the  population  than  others.  This  has  the  effect  of  local  convergence  to  a 
set  of  “good”  solutions. 

Mutation,  as  a source  of  genetic  change,  has  the  effect  of  investigating  all 
scales  at  the  same  time.  As  long  as  there  is  some  mutation,  there  is  always  the 
possibility  to  find  a better  solution.  Crossover,  on  the  other  hand,  is  both  “global” 
and  “local”  in  nature.  If  it  did  not  exist,  genetic  algorithms  would  be  almost 
useless  since  they  would  just  be  a random  algorithm.  From  the  point  of  view  of 
scales  on  the  hypersurface  M{D,P),  the  genetic  pool  of  a population  defines  a 
distribution  of  scales  around  a finite  set  of  solutions. 

The  biggest  challenge  in  implementing  genetic  algorithms  is  to  have  the  “right” 
balance  between  globality  and  locality.  If  mutation  dominates,  it  would  be  less 
likely  for  a population  to  cluster  around  a local  maximum,  but  the  algorithm 
would  converge  very  slowly.  On  the  other  hand,  if  mutation  is  not  very  prominent, 
the  algorithm  can  converge  more  quickly,  but  there  would  be  a higher  probability 
for  the  population  to  cluster  around  a single  local  maxima,  i.e.  the  algorithm 
would  be  “too  local”  in  nature. 

In  addition  to  mutation  and  crossover,  other  factors  such  as  population  control, 
natural  selection  policy,  sexual  selection  and  embedded  algorithms  also  influence 
the  convergence  properties  of  genetic  algorithms.  An  implementation  of  genetic 
algorithms  that  is  too  simplistic  is  likely  to  have  very  poor  convergence.  That 


146 


would  negate  the  desirable  features  of  genetic  algorithms  since  it  would  become 
impractical  to  use  them.  Therefore,  it  is  imperative  to  understand  what  the  effect 
of  each  factor  is  on  the  overall  convergence,  i.e.  the  population  dynamics.  It  would 
be  very  difficult,  if  at  all  possible,  to  investigate  the  dynamics  of  a population  in 
a rigorous  mathematical  fashion.  Nevertheless,  the  study  of  natural  populations 
can  be  a powerful  guiding  principle,  since  nature  has  had  the  time  to  try  many 
different  possible  ways  for  evolution.  This  would  imply  that  the  implementation 
of  genetic  algorithms  models  closely  natural  populations  from  a functional  point 
of  view. 

Before  we  turn  to  a detailed  analysis  of  how  the  various  factors  influence 
the  convergence  of  genetic  algorithms,  we  must  clarify  what  “good  convergence” 
means.  We  are  not  interested  in  fast  convergence  to  a local  maximum  or  flat  space, 
especially  if  this  implies  that  the  population  becomes  less  genetically  diverse.  In 
other  words,  if  the  population  clusters  around  a single  solution,  then  it  would  be 
very  unlikely  that  subsequent  generations  will  find  a better  solution.  Such  algo- 
rithm would  have  the  same  problems  as  local  algorithms.  The  important  measure 
is  that  (eventually)  the  genetic  algorithms  find  the  best  solution  or  solutions,  i.e. 
that  at  least  one  individual  in  a perhaps  distant  generation  carries  genes  corre- 
sponding to  the  best  point  po  € P.  This  implies  that  the  population  should  almost 
never  become  homogeneous,  especially  if  the  parameter  space  P has  many  dimen- 
sions^® . In  biological  terminology,  we  must  maintain  genetic  diversity  at  all  times. 

That  implies  that  mutations  must  be  present  and  significant  at  all  times  during 

^®If  the  model  M is  not  very  complicated  and  the  parameter  space  P does  not  have  too  many 
dimensions,  then  it  might  be  justifiable  to  let  a population  evolve  to  the  point  where  most  or  all 
of  the  individuals  are  the  same.  In  general,  though,  such  approach  does  not  lead  to  good  results. 


147 


evolution,  since  crossover  alone  would  tend  to  decrease  the  genetic  diversity.  It 
is  worth  noting  that  some  simpler  implementations  of  genetic  algorithms  use  the 
lack  of  diversity  as  a measure  to  stop  evolution.  In  general,  we  should  avoid  losing 
diversity  and  not  have  any  specific  measure  when  evolution  should  stop.  We  can 
never  be  certain  that  we  found  the  best  solution  and,  therefore,  we  should  continue 
evolution  for  as  long  as  we  can  afford  it. 

5.4.1  Population  Size 

There  are  two  types  of  genetic  algorithms  with  respect  to  population  size:  The 
first  uses  fixed  size  and  the  second  uses  variable  size.  In  the  genetic  algorithms 
described  here,  we  use  variable  population  size  and  a fixed  maximum  size  in  order 
not  to  exceed  memory  limitations. 

In  general,  the  larger  the  parameter  space  P is,  the  larger  the  maximum  size  of 
the  population  should  be.  If  the  population  is  too  small  relative  to  the  complexity 
of  the  problem,  it  would  not  be  able  to  support  sufficient  genetic  diversity  and 
as  a consequence  the  positive  effects  of  crossover  would  be  diminished.  The  same 
problem  exists  in  biological  populations.  Small  populations  do  not  have  enough 
genetic  diversity  and  are  even  in  danger  of  becoming  extinct.  In  the  context  of 
solving  maximization  problems,  a small  population  does  not  have  the  capacity  to 
“remember”  good  solutions  and  recombine  them  through  crossover  to  get  even 
better  solutions.  As  a result,  the  population  continuously  re-discovers  which  genes 
are  good,  but  never  has  the  time  to  create  better  genes.  The  performance  is  poor 
and  quite  unpredictable. 


148 


The  other  extreme,  too  large  population,  does  not  have  such  negative  impact 
on  evolution.  Nevertheless,  the  population  is  quite  inert  and  changes  can  be  too 
slow.  To  avoid  having  to  guess  what  the  maximum  population  size  should  be, 
natural  selection  is  used  to  speed  up  the  process  of  major  re-organization  of  the 
genetic  pool  of  a population.  This  is  a very  common  occurrence  in  later  stages  of 
evolution  when  the  population  has  discovered  one  or  more  good  solutions  and  is 
working  on  improving  on  them.  When  a brand  new  solution,  that  is  much  better 
than  all  others,  enters  the  population,  it  would  be  wasteful  to  evolve  through  many 
generations  before  that  new  solution  is  solidified.  Instead,  we  use  natural  selection 
relative  to  the  average  performance  of  the  population.  A superior  solution  makes 
almost  all  other  individuals  complete  losers.  As  a consequence,  the  population  size 
is  reduced  drastically  by  killing  many  of  the  old  individuals  and  allowing  the  new 
solution  to  propagate  quickly. 

This  process  of  mass  extinction  has  the  effect  of  a very  efficient  utilization  of  the 
CPU  without  having  to  use  some  arbitrary  parameters  in  the  algorithm  that  are 
likely  to  fail  in  different  circumstances.  While  there  is  no  rigorous  way  to  determine 
the  maximum  size,  due  to  the  adaptive  nature  of  the  population  size  one  only 
needs  to  worry  about  the  population  not  being  too  small.  A population  of  about 
a maximum  of  1000  individuals  works  well  on  very  simple  problems  {dim{P)  = 2), 
as  well  as  on  training  of  neural  networks  (dim{P)  > 100). 

5.4.2  Natural  Selection  (Survival  of  the  Fittest) 

Evolution  without  natural  selection  is  impossible.  After  all,  evolution  of  a 
population  is  its  ability  to  adapt  to  the  environment.  Individuals  that  are  less  fit 


149 


have  a smaller  chance  to  survive  and,  therefore,  to  reproduce.  As  a result,  less  fit 
individuals  have  a smaller  probability  to  have  any  impact  on  future  generations. 
This  probability,  however,  is  not  zero.  In  natural  populations,  individuals  are  not 
necessarily  killed  just  because  a given  trait  makes  them  less  fit  than  others.  In 
fact,  it  can  happen  that  two  separately  “bad”  traits  can  produce  a good  trait  if 
combined  together  via  crossover,  for  example. 

Artificial  populations,  not  surprisingly,  also  benefit  from  a more  passive  role 
of  the  environment  when  it  comes  to  killing  an  individual  because  he  or  she  is 
less  fit  than  others.  In  the  genetic  algorithms  described  here,  natural  selection 
kills  individuals  on  a probabilistic  basis.  Suppose  there  are  N individuals  in  a 
population  during  the  current  generation.  Natural  selection  randomly  kills  about 
50%  of  the  individuals  whose  performance  is  less  than  average  as  shown  in  Fig. 
5.8.  That  implies  that  even  individuals  with  a very  bad  performance  have  a 
finite  probability  of  surviving  at  least  a few  simulation  years^°.  The  possibility 
for  survival  of  “unfit”  individuals  is  a critical  factor  for  the  robust  convergence  of 
genetic  algorithms  to  better  populations. 

On  the  surface,  it  may  seem  wasteful  to  keep  bad  performers  around.  In  fact,  it 
is  exactly  the  opposite.  If  natural  selection  unconditionally  exterminates  the  worst 
individuals,  the  convergence  would  be  orders  of  magnitude  slower.  To  illustrate 
the  reason  for  this  effect,  let  us  consider  the  simple  one-dimensional  maximization 
problem  shown  in  Fig.  5.9.  The  population  consists  of  individuals  who  have 

already  discovered  the  two  local  maxima  (represented  with  circles)  as  well  as  one 

^“Simulation  year  is  one  cycle  of  evolution.  It  is  less  than  one  generation,  since  the  average 
life-span  of  individuals  is  at  least  several  years  and  since  they  do  not  necessarily  reproduce  every 
year. 


150 


average  average  average 


Figure  5.8:  Shows  different  performance  distributions  for  individuals  of  a popula- 
tion. The  dashed  line  in  each  case  is  the  average  performance  for  that  distribution. 
Here  N is  the  number  of  individuals  per  unit  performance  and  Mmax  is  the  perfor- 
mance of  the  best  individual.  In  the  first  case,  most  of  the  individuals  are  below 
the  average  and  as  a result  many  individuals  die  due  to  natural  selection.  In  the 
second  case,  about  50%  are  below  average  and  only  about  25%  of  the  population 
dies  due  to  natural  selection.  In  the  last  case,  most  of  the  individuals  have  above 
average  performance  and  as  a result  very  few  are  affected  by  natural  selection. 

individual  whose  gene  is  closest  to  the  global  maximum  (represented  with  the 
square).  The  most  likely  cause  for  this  new  genetic  information  being  introduced 
in  the  population  is  mutation.  Mutation,  however,  is  a completely  random  event 
and  it  is  not  likely  at  all  that  a mutation  alone  will  discover  the  global  maximum 
or  even  produce  a solution  better  than  the  current  average  for  the  population. 
In  other  words,  mutation  can  put  an  individual  in  the  vicinity  of  a much  better 
solution  with  a reasonable  probability,  but  this  solution  has  to  be  refined  by  other 
mechanisms,  such  as  crossover,  for  example. 

If  we  were  to  unconditionally  kill  the  worst  individuals,  we  would  lose  valuable 
genetic  information  discovered  by  mutation.  We  would  have  to  wait  until  mutation 
discovers  not  only  the  vicinity  of  the  best  solution,  but  actually  a solution  much 


151 


Figure  5.9:  Represents  a population  of  individuals.  Each  individual  has  a gene  for 
the  one-dimensional  parameter  space  P.  The  curve  represents  the  model  function 
M{P).  The  circles  are  individuals  clustering  around  the  two  local  maxima.  The 
square  is  an  individual  that  has  the  worst  performance,  yet  its  offspring  has  the 
greatest  chance  for  discovering  the  global  maximum. 

better  than  the  average.  This  is  not  likely  to  happen  and  such  algorithm  would 
have  very  poor  convergence,  especially  for  large  parameter  spaces. 

On  the  other  hand,  if  we  do  not  kill  the  worst  performers  unconditionally, 
the  new  individual  will  have  a chance  to  reproduce.  During  the  reproduction, 
its  child  can  have  a better  performance  due  to  crossover  or  some  local  embedded 
algorithms  which  we  will  discuss  later.  As  a consequence,  a more  passive  role 
of  the  environment  when  it  comes  to  natural  selection  improves  the  convergence 
immensely^^. 


5.4.3  Reproduction 

Reproduction  is  another  critical  factor  in  the  successful  construction  of  genetic 

algorithms.  All  genetic  algorithms  that  use  crossover  implicitly  assume  the  exis- 

^^Some  simpler  genetic  algorithms  use  a fixed  population  size,  double  the  population  in  each 
cycle  and  then  discard  the  worst  half.  The  end  result  is  that  it  takes  a long  time  to  discover  a 
new  solution  after  which  the  population  loses  its  genetic  diversity  and  practically  stops  evolving. 
One  cannot  expect  to  achieve  much  with  such  unjustified  simplifications. 


152 


tence  of  sex.  Yet  the  advantages  that  an  explicit  modeling  of  sex  offers  are  often 
ignored.  As  with  all  other  evolutionary  factors,  nature  provides  us  with  successful 
solutions.  In  particular,  the  predominance  of  sexual  reproduction  in  living  organ- 
isms implies  that  there  is  a significant  benefit  to  reproduce  in  that  fashion.  Even 
some  bacteria  can  reproduce  sexually. 

The  major  advantage,  as  it  was  mentioned  earlier,  is  the  speed  with  which 
a population  can  change  its  genetic  pool  due  to  crossover  of  DNA.  In  addition, 
more  complex  organisms  exhibit  complex  sexual  behavior  between  members  of  a 
population.  This  has  the  effect  of  further  speeding  up  the  process  of  genetic  change. 
From  our  point  of  view,  modeling  of  these  phenomena  improves  the  convergence 
of  genetic  algorithms  even  in  the  constant  presence  of  mutation. 


Sexual  Selection 

In  the  version  of  genetic  algorithms  described  here,  we  model  the  individuals 
as  hermaphrodites.  Each  individual  can  be  either  male  or  female  during  a single 
act  of  reproduction.  Hence  the  individuals  are  most  closely  related  to  plants,  since 
this  kind  of  sexual  reproduction  is  mostly  used  by  plants^^. 

During  a simulation  year,  an  individual  makes  a decision  whether  to  reproduce 

or  not.  This  decision  is  probabilistic  in  nature  and  it  depends  on  the  relative 

is  interesting  to  note  that  there  seems  to  be  a correspondence  between  the  environment 
in  the  genetic  algorithms  described  here  and  the  environment  of  plants  from  the  point  of  view 
of  the  reproduction  mechanism.  Many  plants  are  hermaphrodites  because  they  cannot  move  in 
the  physical  environment  and,  therefore,  if  external  agents  for  the  transfer  of  genetic  information 
are  not  present  (such  as  wind,  bees,  etc.),  they  would  have  to  mate  with  themselves.  In  the 
genetic  algorithms  described  here,  the  individuals  are  also  not  active  in  the  environment.  For 
example,  physical  motion  is  not  modeled.  Yet  they  do  not  have  an  explicit  sex  simply  because 
their  behavior  does  not  require  it  and  not  because  we  made  the  decision  to  model  plants.  There 
seems  to  be  consistency  between  the  natural  world  and  the  artificial  model. 


153 


population  size  during  that  year  as  shown  in  Fig.  5.10.  The  relative  population 
size  is  the  current  size  of  the  population  normalized  with  respect  to  the  maximum 
population  size.  The  birth  rate,  Rb,  decreases  as  the  population  grows.  This 

1 

Birth 

Probability 

0 

0 Relative  Population  Size  1 

Figure  5.10:  Shows  the  birth  rate  as  a function  of  the  relative  population  size.  The 
more  individuals  there  are,  the  less  likely  it  is  to  produce  offspring.  This  is  one  of 
the  population  control  factors. 

contributes  to  an  improved  performance  in  two  different  ways: 

1 . When  the  population  is  large,  it  is  likely  that  there  is  no  substantially  superior 
genetic  material  present  in  the  genetic  pool  except  for  mutations  that  might 
eventually  lead  to  a better  solution.  As  a result,  it  would  be  wasteful  to 
produce  too  many  children,  since  they  are  not  likely  to  improve  the  currently 
best  solution  too  much,  if  at  all,  while  at  the  same  time  they  would  decrease 
the  probability  for  new  mutations  to  survive  since  the  new  children  would 
also  compete  for  resources  with  inferior  but  promising  individuals.  As  a net 
result,  when  the  population  is  large,  it  is  relatively  stable  and  most  of  the 
CPU  time  is  spent  on  the  potential  improvement  of  new  mutations. 


154 


2.  If  the  individuals  use  embedded  algorithms  (which  are  described  later),  they 
have  the  opportunity  to  improve  (most  likely  in  a local  fashion)  their  existing 
solutions  and  not  spend  too  many  resources  for  reproduction. 

In  both  cases,  a large  population  implies  a relatively  stable  population  which  does 
not  contain  solutions  that  differ  by  much.  In  that  case,  crossover  is  not  the  leading 
factor  for  evolution  - mutation  and  (possibly)  embedded  algorithms  are. 

If,  based  on  the  birth  rate  Rb,  an  individual  decides  to  reproduce,  then  we 
treat  it  as  a female  with  respect  to  sexual  behavior.  The  female  examines  several 
individuals  in  the  population,  which  we  treat  as  males,  and  chooses  the  best  one 
among  them.  It  is  important  to  emphasize  that  the  female  does  not  choose  the 
star  performer  of  the  population.  Such  behavior  would  be  detrimental  to  genetic 
diversity  and,  in  the  long  run,  would  lead  to  most  individuals  being  the  same  and 
representing  some  local  maximum. 

The  act  of  choosing  the  best  male  among  a small  group  of  individuals  has  two 
implications  for  the  convergence  properties  of  the  algorithm: 

1.  Due  to  the  fact  that  there  is  an  additional  selection  mechanism,  namely 
sexual  selection,  the  rate  of  genetic  change  is  increased,  since  good  genes  have 
an  even  higher  probability  of  being  disseminated  through  future  generations. 
This  is  important  whenever  a significantly  better  solution  is  discovered  which 
makes  the  existing  genetic  material  obsolete. 

2.  Due  to  the  fact  that  only  a small  subset  of  males  participate  in  this  selection, 
the  population  is  not  in  danger  of  losing  its  genetic  diversity. 


155 


Furthermore,  the  selection  of  a mate  does  not  have  to  use  a single  performance 
measure.  If  more  than  one  measure  is  used,  then  the  genetic  algorithms  would 
go  beyond  a maximization  algorithm  since  maximization  algorithms  assume  a real 
valued  function  on  the  parameter  space. 

Mutation 

Mutation,  as  it  was  already  pointed  out,  is  the  only  source  of  genetic  variation 
that  leads  to  brand  new  genetic  material.  Therefore,  it  is  imperative  that  mutation 
be  present  at  all  stages  of  evolution  in  order  to  make  sure  that  the  algorithm 
converges  to  a global  maximum.  We  do  not  know  in  advance  how  many  iterations 
we  need,  nor  do  we  know  whether  a solution  is  the  global  maximum.  However,  if 
the  mutation  rate  is  zero  the  algorithm  will  not  be  able  to  find  the  best  solution. 

The  value  of  the  mutation  rate,  Rm,  has  a profound  effect  on  the  convergence  of 
the  algorithm.  A high  value  is  very  beneficial  at  the  initial  stages  of  evolution  since 
it  increases  the  speed  of  discovering  few  candidate  solutions.  On  the  other  hand,  at 
later  stages  of  evolution  a high  value  of  Rm  has  a negative  impact  on  convergence 
since  most  mutations  are  damaging  and  thus  the  existing  good  genetic  information 
would  be  destroyed.  Therefore,  we  need  some  mechanism  to  change  the  mutation 
rate  dynamically. 

Many  implementations  of  genetic  algorithms  use  a fixed  mutation  rate  usually 
in  the  range  0.001  < R^.  < 0.03.  Such  an  approach,  however,  has  an  extremely 
negative  impact  on  the  convergence.  The  value  for  Rm  depends  on  both  the  prob- 
lem being  solved  and  the  current  stage  of  evolution.  Trying  to  use  a fixed  value  is 
inadequate  and  completely  arbitrary. 


156 


In  our  implementation  of  genetic  algorithms,  the  mutation  rate  is  a gene.  As 
a result,  each  individual  has  a different  mutation  rate  which  is  used  during  repro- 
duction. This  gene  is  not  part  of  the  performance  measure  M{P).  Therefore,  the 
performance  of  an  individual  does  not  depend  on  the  value  of  its  mutation  rate. 
However,  the  distribution  of  mutation  rates  in  a population  changes  over  time  and 
adapts  itself  to  the  current  needs  of  the  population.  This  process  of  adaptation 
proceeds  as  follows: 

1.  When  the  population  is  in  its  initial  stages  of  evolution,  the  genes  of  its 
members  are  initially  completely  random.  This  implies  that  individuals  have 
completely  random  mutation  rates  - some  have  very  low  rates,  some  have 
rates  approaching  1.  During  reproduction,  the  individuals  with  low  mutation 
rates  produce  offspring  that  are  not  much  different  from  the  two  parents,  i.e. 
the  offspring  are  not  likely  to  have  genome  that  is  better  than  that  of  the 
parents.  In  contrast,  individuals  with  high  values  of  their  mutation  rate  have 
offspring  that  are  significantly  different  from  the  parents.  This  increases  the 
chance  of  discovering  a genome  that  has  a higher  performance  measure  M{P). 
As  a consequence,  the  offspring  of  parents  with  high  Rm  are  more  likely  to 
be  better  performers  than  offspring  of  parents  with  low  Rm-  This  implies  a 
very  efficient  use  of  CPU  resources  since  at  this  stage  the  genetic  algorithm, 
being  dominated  by  mutation,  acts  as  a random  algorithm  trying  to  discover 
the  overall  regions  in  the  parameter  space  P that  are  close  to  some  maxima. 

2.  When  a group  of  individuals  clusters  around  a region  in  the  P space  that 
is  “promising”,  crossover  (or  an  embedded  local  algorithm)  can  refine  the 
solution  much  faster  than  mutation.  As  a consequence,  individuals  in  this 


157 


group  with  a lower  mutation  rate  are  more  likely  to  produce  offspring  with 
better  performance  since  a high  level  of  mutation  is  more  likely  to  damage 
an  existing  solution  than  to  improve  it.  As  a result,  the  mutation  rates 
of  individuals  will  start  decreasing  and  crossover  will  start  playing  a more 
prominent  role.  The  solution  at  this  stage  is  being  refined. 

3.  During  the  last  stages  of  finding  a local  maximum  the  mutation  rates  decrease 
to  their  minimum  possible  value,  since  only  crossover  (and  embedded  local 
algorithms,  if  any)  can  get  to  the  local  maximum.  If  the  minimum  mutation 
rate  is  0,  there  is  a high  probability  that  eventually  all  individuals  will  have 
Rm  = 0.  That  would  imply  that  no  further  improvement  in  the  population 
is  possible.  As  discussed  earlier,  this  should  be  avoided  except  possibly  for 
very  simple  problems.  For  that  reason,  we  use  a minimum  mutation  rate 
^mm  ^ Q actual  value  is  not  very  criticaF^.  A minimum  rate  of 

about  = 0.001  works  well  in  both  simple  and  complicated  maximization 
problems. 


The  3 stages  described  above  are  only  meant  to  illustrate  3 limiting  cases.  In  fact, 
there  is  a continuum  of  stages.  The  mutation  rate  distribution  in  a population 
changes  continuously.  The  algorithm  adapts  itself  to  the  current  needs  for  glob- 
ality  and  locality.  For  that  reason,  in  this  implementation  of  genetic  algorithms 

mutation  should  not  be  viewed  as  just  another  term  for  a random  algorithm.  Since 

^^Despite  the  fact  that  the  minimum  value  of  il™®"  is  not  too  critical,  it  still  implies  a degree 
of  arbitrariness.  It  would  be  better  not  having  to  select  a minimum  value  at  all.  One  way  that 
can  be  achieved  is  to  model  mutation  that  originates  with  the  environment.  That  would  be 
analogous  to  (for  example)  radiation  in  a natural  environment.  In  that  case  it  would  not  matter 
if  the  internal  mutation  rate  is  zero,  since  there  would  always  be  a finite  probability  for  mutation 
due  to  the  environment. 


158 


mutation  is  part  of  the  genome,  it  is  a dynamical  variable  of  the  population  dy- 
namics. Thus  one  can  view  mutation  as  making  the  genetic  algorithm  an  adaptive 
global  algorithm  without  arbitrary  parameters  that  control  its  behavior.  Unlike 
other  “adaptive”  global  algorithms  (such  as  simulated  annealing,  for  example), 
there  is  no  assumption  about  the  level  of  “noise”  one  should  use.  Due  to  the  sig- 
nificance of  the  impact  that  mutation  has  on  the  convergence  of  genetic  algorithms, 
the  representation  of  mutation  as  a gene  is  perhaps  the  single  most  important  fac- 
tor in  the  construction  of  a practical  genetic  algorithm. 

Crossover 

As  shown  above,  the  adaptive  nature  of  mutation  allows  the  genetic  algorithm 
to  change  its  degree  of  “globality”  dynamically.  This  would  be  useless,  however, 
unless  there  is  a mechanism  that  allows  a “local”  refinement  of  an  existing  set  of 
solutions.  Crossover  is  such  a universal  mechanism  that  exhibits  local  properties. 
We  should  emphasize,  however,  that  crossover  does  not  assume  the  existence  of 
any  topology  on  the  parameter  space  P or  any  subset  of  it.  For  that  reason, 
when  we  say  that  crossover  has  “local”  features,  what  we  mean  is  that  if  there  is 
a suitable  topology  on  P that  defines  the  notion  of  closeness,  then  crossover  will 
have  the  effect  of  local  convergence.  To  illustrate  that  fact,  let  us  consider  the 
following  example:  Suppose  that  the  parameter  space  P is  the  set  of  real  numbers 
in  the  interval  [0, 1].  Further,  suppose  that  we  represent  each  real  number  with 
one  byte.  Then  there  is  a total  of  2®  = 256  solutions.  During  reproduction,  each 
of  the  parents  represents  one  of  these  256  solutions.  The  offspring,  however,  will 
be  a combination  of  the  2 parental  solutions.  In  the  case  of  single  point  crossover. 


159 


there  are  exactly  16  possible  combinations  for  the  child  that  are  induced  by  the 
two  parents.  In  general,  if  the  genes  representing  points  in  P are  N bits  long, 
then  there  are  a total  of  2^  solutions,  but  there  are  only  2N  combinations  due  to 
crossover  of  two  parental  genomes.  This  is  an  immense  reduction  of  combinations 
especially  for  high  values  of  N which  is  the  typical  case. 

We  now  consider  two  cases  of  crossover  - parents  that  correspond  to  differ- 
ent solutions  and  parents  that  are  close  to  the  same  solution.  In  the  first  case, 
one  of  the  parents,  let  us  say  the  mother,  corresponds  to  the  first  maximum  in 
Fig.  5.9  and  the  father  corresponds  to  the  third  maximum.  If  Pmother  — 0.1992 
and  Pfather  = 0.8001,  then  their  binary  representations  would  be  00110011  and 
11001101  respectively.  Suppose  that  the  crossover  point  is  as  shown  in  Fig.  5.11. 
Then  the  child  will  have  its  first  6 bits  from  one  of  the  parents  and,  therefore,  it 
would  be  close  to  the  solution  of  that  parent.  However,  the  lowest  2 bits  are  in- 
herited from  the  parent  that  corresponds  to  a completely  different  solution.  Since 
the  low  bits  of  two  different  solutions  cannot  have  anything  in  common,  the  effect 
is  random  with  respect  to  the  low  bits.  Because  of  that,  crossover  between  very 
different  individuals  is  still  beneficial,  since  it  attempts  the  refinement  of  a solu- 
tion on  the  basis  of  a small  scale  noise.  The  scale,  of  course,  depends  on  where 
the  crossover  point  is.  The  closer  to  the  higher  bits,  the  larger  the  scale,  and 
vice  versa.  Regardless  where  the  crossover  point  is,  however,  the  essential  genetic 
information  is  preserved  since  the  child  always  resembles  one  of  the  parents  more 
than  the  other. 

If  the  two  parents  are  clustered  around  the  same  solution,  then  crossover  has 
a different  effect.  For  example,  let  the  mother  still  have  Pmother  — 0.1992.  Let 


160 


Mother 

Father 


Child 


Figure  5.11:  Crossover  of  two  parental  genes  that  correspond  to  two  completely 
different  solutions.  The  mother’s  gene  represents  the  point  Pmother  = 0.1992  and 
the  father’s  gene  represents  the  point  pjather  — 0.8001.  The  child  is  very  close  to 
the  parent  from  which  it  inherited  the  high  6 bits.  Since  the  low  2 bits  correspond 
to  a completely  different  solution,  they  act  as  noise  on  a small  scale. 

the  father  gene  now  be  p father  = 0.2031,  i.e.  both  parents  are  close  to  the  first 
maximum.  Their  corresponding  binary  representations  are  00110011  and  00110100 
respectively.  Assuming  the  same  crossover  point  as  in  the  previous  example,  the 
result  of  crossover  is  as  shown  in  Fig.  5.12.  Since  both  parents  have  the  same  5 
highest  bits,  crossover  has  no  effect  on  the  first  5 bits.  In  other  words,  the  child 
will  definitely  inherit  this  valuable  information  and  will  not  deviate  too  much  from 
the  first  maximum.  Its  low  3 bits,  however,  will  be  a combination  of  the  2 parents. 
This  is  no  longer  a random  factor,  but  instead  it  acts  as  a small  scale  enumeration 
algorithm.  The  net  result  is  refinement  of  the  solution  since  the  child’s  performance 
can  exceed  that  of  either  parent  in  which  case  future  generations  will  incorporate 
that  information  in  the  genetic  pool  of  the  population. 

The  crossover  rate,  Rc,  determines  whether  crossover  should  take  place  or  not. 
A smaller  value  for  Rc  results  in  many  individuals  that  are  exactly  the  same  and. 


161 


1 1 I Mother 


0 0 Father 


Child 


Figure  5.12:  Crossover  of  two  parental  genes  that  correspond  to  two  very  similar 
solutions.  The  mother  gene  represents  the  point  Pmother  = 0.1992  and  the  father 
gene  represents  the  point  Pfather  = 0.2031.  The  child  is  very  close  to  both  parents, 
but  its  low  3 bits  form  a new  combination  which,  if  better,  will  propagate  through 
future  generations. 


as  a result,  genetic  diversity  is  diminished.  Higher  values  of  Rc  result  in  a more 
rapid  genetic  recombination  with  all  the  positive  consequences  that  follow  from  it. 
In  genetic  algorithms  that  use  a fixed  mutation  rate,  the  crossover  rate  should,  in 
general,  not  be  1,  since  the  algorithm  resembles  an  enumeration  algorithm  at  the 
early  stages  of  evolution.  As  it  was  the  case  with  the  mutation  rate,  the  crossover 
rate  also  can  benefit  from  becoming  a dynamical  variable  of  the  population  dy- 
namics. For  that  reason,  in  our  version  of  genetic  algorithms  the  crossover  rate  is 
also  a gene.  The  distribution  of  crossover  rates  changes  over  time  adapting  itself 
to  the  current  “needs”  of  the  population.  Unlike  the  mutation  rate,  however,  the 
convergence  of  the  algorithm  is  not  very  sensitive  to  the  value  of  Rc  assuming  that 
the  mutation  rate  is  a gene. 


162 


5.4.4  Gene  Linkage 

Gene  linkage  refers  to  the  physical  association  of  two  genes.  If  two  genes  reside 
on  separate  chromosomes,  then  they  are  independent  of  each  other,  or  unlinked. 
However,  if  two  genes  reside  on  the  same  chromosome,  then  the  relative  distance 
between  them  is  a measure  of  how  linked  these  two  genes  are.  The  reason  for  this 
is  crossover.  In  particular,  when  a child’s  DNA  is  formed,  crossover  is  combining 
entire  fragments  of  DNA  from  both  parents.  Therefore,  the  closer  two  genes  are, 
the  more  likely  it  is  that  they  will  have  originated  from  the  same  parent. 

A particular  case  of  linkage  is  an  operon.  An  operon  is  a sequence  of  genes 
that  are  so  strongly  related  to  each  other,  that  it  is  meaningless  to  express  only 
some  of  them.  In  other  words,  their  combined  action  is  already  fine-tuned  and 
represented  by  an  operon.  If  such  genes  were  separated  on  the  DNA,  then  there 
would  be  a much  higher  probability  that  some  of  the  genes  would  come  from  one 
parent  and  others  from  the  other.  Then  the  likelihood  for  disturbing  the  correct 
action  of  the  genes  on  the  operon  would  have  been  very  high.  In  living  organisms 
operons  are  extremely  common.  In  genetic  algorithms  they  are  also  very  common. 
For  example,  in  the  next  section  we  will  represent  linear  cuts  as  genes.  Our  goal 
will  be  to  find  a set  of  left  and  right  cuts  on  some  input  variables.  An  operon  in 
that  case  is  the  combination  of  left  and  right  cut  since  one  does  not  make  sense 
without  the  other. 

Operons,  however,  are  not  the  only  examples  of  genes  that  are  related  to  each 
other.  Depending  on  what  the  parameter  space  P represents,  we  may  know  in 
advance  that  some  genes  are  more  related  to  each  other  than  others.  If  that  is  the 
case,  we  must  place  them  next  to  each  other.  Failure  to  do  so  will  slow  down  the 


163 


convergence  of  the  genetic  algorithms  since  the  population  is  lacking  information 
that  was  known  to  us  in  advance  and  we  did  not  utilize  it. 

It  is  worth  noting  that  our  model  of  evolution  differs  from  nature  in  this  respect. 
In  real  chromosomes,  the  location  of  genes  is  not  predetermined.  It  is  discovered 
dynamically.  In  other  words,  nature  discovers  not  only  which  genes  are  good,  but 
also  where  to  place  them  relative  to  each  other  in  order  to  minimize  the  disturbing 
effect  from  crossover  on  genes  that  are  related.  The  two  major  structural  ingredi- 
ents that  allow  genes  to  “migrate”  on  the  DNA  molecule  are  the  mechanisms  for 
gene  expression  and  the  ability  to  transpose  a DNA  fragment  from  one  place  to 
another.  Both  of  them,  especially  gene  expression,  are  difficult  to  model.  Never- 
theless, even  if  our  choice  of  gene  layout  is  incorrect,  we  would  still  be  able  to  find 
good  solutions.  It  will  just  take  longer. 

In  the  next  two  sections  we  will  give  specific  gene  maps  for  linear  cuts  and 
neural  nets. 

5.4.5  Age 

Each  individual  in  our  genetic  algorithms  can  die  in  one  of  two  ways — due  to 
natural  selection  or  old  age.  The  death  rate,  Rd,  depends  on  the  relative  population 
size  as  shown  on  Fig.  5.13.  After  an  individual  receives  execution  from  the 
environment  and  after  it  has  optionally  reproduced,  it  can  die  with  a probability 
determined  by  the  distribution  shown  on  Fig.  5.13.  Since  this  probability  is  tested 
during  each  simulation  year,  the  cumulative  probability  for  death  in  the  long  run 
is  1.  The  life  expectancy  itself  depends  on  the  number  of  deaths  due  to  natural 
selection,  and  on  the  current  population  size.  The  latter  affects  both  the  birth  rate 


1 


164 


Death 

Probability 


0 — 1 

0 Relative  Population  Size  1 

Figure  5.13:  Shows  the  death  rate  as  a function  of  the  relative  population  size. 
The  more  individuals  there  are,  the  more  likely  each  is  to  die.  This  is  another 
population  control  factor. 

and  the  death  rate.  The  actual  shape  of  the  probabilities  for  both  birth  and  death 
rate  are  not  critical  at  all.  However,  assuming  that  deaths  from  natural  selection 
are  not  changing,  these  two  curves  define  the  balance  between  births  and  deaths. 
There  can  be  short  term  increases  or  decreases  in  the  population,  but  they  cannot 
(even  in  principle)  persists  in  the  long  run. 

There  are  genetic  algorithms  that  do  not  utilize  death  from  old  age.  Those  algo- 
rithms have  then  to  rely  on  some  fixed  population  control  policy,  as  was  described 
earlier.  In  nature,  however,  this  is  never  the  case.  Each  species  has  an  optimal  life 
expectancy  that  balances  the  need  for  change  and  the  need  to  preserve  the  genetic 
pool.  Even  the  best  performers  in  a population  should  eventually  make  room  for 
new  individuals.  Otherwise,  the  population  is  in  danger  of  losing  its  genetic  di- 
versity since  few  very  old  individuals  (perhaps  even  with  identical  genomes)  could 
stagnate  the  population  by  effectively  suppressing  new  genetic  solutions.  As  much 


165 


as  none  of  us  wants  to  die,  in  the  long  run  this  is  necessary  for  the  preservation 
and  evolution  of  the  species. 

There  is  one  final  point  about  population  control  that  should  be  mentioned. 
Both  death  rate  and  birth  rate  utilize  probability  functions  depending  on  the  rel- 
ative population  size.  Since  they  implicitly  determine  the  average  life  expectancy 
(excluding  the  effects  of  natural  selection  that  result  in  mass  extinctions),  we  can 
observe  that  there  is  no  exact  correspondence  between  nature  and  the  model.  In 
nature,  each  species  discovers  its  optimal  life  expectancy  - trees  and  eagles  live 
longer  than  we  do,  dogs  and  cats  (regrettably)  do  not.  In  the  model,  we  are  es- 
sentially fixing  the  life  expectancy.  This  is  not  necessarily  good  since  different 
maximization  problems  might  benefit  from  different  longevity  of  the  individuals 
representing  them.  However,  there  is  one  possible  extension  of  the  current  version 
of  the  genetic  algorithms  described  here.  If  each  individual  has  a small  brain  (a  tiny 
neural  network)  which  is  used  to  make  the  decision  for  reproduction  and  death, 
and  if  the  network  weights  are  genes,  then  a population  should  be  able  to  discover 
its  optimal  life  expectancy  which  should  only  weekly  depend  on  the  maximum 
population  size.  In  essence,  this  would  constitute  the  modeling  of  instincts. 

5.4.6  Directed  Genetic  Changes  (Embedded  Algorithms) 

The  final  factor  that  can  affect  the  performance  of  a genetic  algorithm  is  the 
existence  of  additional  (embedded)  algorithms  that  change  the  genes  representing 
points  in  the  parameter  space  P.  For  example,  suppose  that  the  parameter  space 
P (or  a subset  of  it)  is  i?”.  Since  i?”  naturally  supports  a number  of  topologies, 
we  could  utilize  a local  algorithm  that  is  invoked  every  time  an  individual  receives 


166 


execution  from  the  environment.  In  that  case,  we  would  be  changing  the  expression 
of  the  genes,  not  the  genes  themselves.  If  we  perform  a reverse  translation  and 
reverse  transcription  of  the  numbers,  then  any  improvement  that  might  have  been 
achieved  by  that  local  algorithm  would  be  inheritable. 

To  illustrate  embedded  algorithms,  let  us  construct  an  algorithm  that  we  call 
an  implicit  gradient  descent  algorithm.  Let  A G i?.  Let  us  represent  A as  a gene. 
Let  the  parameter  space  P = i?”.  Then  the  implicit  gradient  descent  algorithm 
can  be  defined  as  follows: 

1.  An  individual  receives  execution  from  the  environment. 

2.  The  individual  randomly  selects  a gene  corresponding  to  one  of  the  n dimen- 
sions of  P.  Let  X E Rhe  the  expression  of  that  gene. 

3.  The  individual  changes  x by  either  adding  or  subtracting  A to  it,  i.e.  x 

re  ± A.  That  means  that  the  individual  changes  the  point  p E P to  the  point 
p'  E P with  respect  to  the  nth  dimension. 

4.  The  individual  computes  the  new  performance  M'{p'). 

5.  IfM'(p')  > M (p) , then  the  individual  performs  reverse  translation  and  tran- 
scription. I.e.,  it  updates  the  gene  for  the  nth  dimension  so  that  it  corresponds 
to  the  new  numerical  value  of  x. 

The  major  advantage  of  such  an  algorithm  is  the  fact  that  we  explicitly  utilize 
the  fact  that  the  space  P is  a topological  space  which  can,  in  certain  circum- 
stances, improve  the  convergence  to  the  best  solution.  One  should  be  very  careful, 
however,  when  applying  local  embedded  algorithms.  They  definitely  improve  the 


167 


convergence  to  a local  maximum,  but  they  will  not  necessarily  improve  the  con- 
vergence to  a global  maximum.  The  major  reason  for  that  is  genetic  diversity  (or 
the  lack  of  it).  In  the  absence  of  embedded  local  algorithms,  crossover  and  sexual 
selection  are  the  leading  factors  for  the  refinement  of  a solution.  Neither  of  them 
depends  strongly  on  any  single  individual.  In  contrast,  a single  individual  that 
employs  an  embedded  local  algorithm  can  improve  its  performance  too  fast.  As 
a consequence,  natural  selection  can  initiate  a mass  extinction,  since  it  is  more 
likely  that  most  other  individuals  are  inferior  and  that  the  average  performance 
is  slightly  below  the  maximum  performance.  The  mass  extinction  would  remove 
valuable  genetic  information  that  might  lead  to  a better  performance  than  the 
current  star  performer.  In  other  words,  a local  embedded  algorithm  might  be  “too 
fast”  and  condemn  the  population  to  a local  maximum. 

The  implicit  gradient  descent  algorithm  described  above  does  not  have  such 
profound  negative  effects  on  the  ultimate  convergence  to  the  global  maximum. 
The  reason  is  two-fold:  First,  only  one  of  the  dimensions  is  changed;  Second,  the 
change  A is  a gene  which  implies  an  adaptation  to  the  current  stage  of  evolution 
of  a population.  In  particular,  during  the  initial  stages  of  evolution,  mutation  is 
the  leading  factor  which  also  implies  that  the  value  of  the  gene  A will  be  very 
large  (50%  of  its  maximum  value).  Consequently,  it  is  not  likely  at  all  that  the 
implicit  gradient  descent  algorithm  will  have  any  noticeable  effect,  since  the  step 
A by  which  it  is  changing  the  values  is  too  large  and,  therefore,  it  is  a variation 
of  a large  scale  random  algorithm.  When  crossover  and  sexual  selection  start  to 
become  important,  the  A gene  still  cannot  affect  evolution  too  much,  since  it  acts 
on  a random  projection  of  the  space  P and  there  is  a high  probability  for  mismatch 


168 


between  the  scale  that  A implies  and  the  current  scale  of  crossover.  If  A is  much 
bigger  than  the  current  scale  implied  by  crossover  for  a certain  dimension,  then  it 
will  not  contribute  to  an  improvement  of  a solution.  If  A is  too  small  compared 
to  the  scale  of  crossover,  then  the  improvement  is  insignificant  since  it  is  likely 
that  future  crossovers  will  ignore  it.  During  the  final  stages  of  improvement  of  a 
single  solution,  A becomes  the  leading  factor  with  respect  to  the  dimension  that 
can  most  benefit  from  it.  During  that  stage,  mutation  is  very  low  and  crossover 
only  works  on  a scale  smaller  or  comparable  to  that  of  A.  During  that  last  stage, 
A becomes  the  leading  source  of  change  and  unless  mutation  is  still  present  (at 
any  level) , there  would  be  a convergence  to  a (most  likely)  local  maximum  lead  by 
the  implicit  gradient  descent  algorithm. 

As  can  be  seen  from  the  above  discussion,  there  is  a loss  of  time  due  to  the 
implicit  gradient  descent  algorithm  in  the  first  two  stages  of  evolution  and  a gain 
in  time  during  the  last  stage  of  evolution  of  a single  solution.  Empirical  evidence 
suggests  that  if  the  parameter  space  P is  large  (say  dim{P)  > 100),  then  an 
implicit  gradient  descent  algorithm  is  not  beneficial  in  the  long  run,  i.e.  it  forces 
a population  into  a local  maximum.  Empirical  evidence,  of  course,  cannot  be 
used  blindly  in  every  circumstance.  Nevertheless,  at  least  for  simpler  problems, 
an  embedded  local  algorithm  can  be  beneficial  if  one  is  very  cautious  about  its 
consequences. 

The  implicit  gradient  descent  algorithm  described  above  is  just  one  example 
of  an  embedded  algorithm.  One  could  venture  to  use  any  additional  algorithm  in 
conjunction  with  genetic  algorithms.  The  major  advantage  is  that  any  parame- 
ters that  this  additional  algorithm  may  require  can  be  represented  as  genes  and. 


169 


therefore,  be  converted  into  dynamical  variables  of  the  population  dynamics.  The 
disadvantage  is  that  one  can  no  longer  assume  a generic  convergence  to  the  global 
maximum.  In  each  specific  case,  one  would  have  to  investigate  the  convergence 
eflPects  that  an  embedded  algorithm  has. 

Embedded  algorithms  (if  used)  are  the  most  significant  deviation  of  the  com- 
puter model  from  nature.  In  genetics,  there  is  the  so  called  central  dogma.  It 
stipulates  that  information  flow  is  uni-directional  - from  DNA  to  RNA  to  proteins. 
Since  embedded  algorithms  afiPect  numbers  (which  are  the  analog  of  proteins)  then 
reverse  translation  and  transcription  would  violate  the  central  dogma.  However, 
it  is  not  yet  clear  to  what  extent  the  central  dogma  is  valid.  There  are  at  least  2 
definite  violations  and  one  possible  violation.  They  are  as  follows: 

1.  Retroviruses  definitely  utilize  reverse  transcription. 

2.  It  is  almost  certain  that  life  has  initially  chosen  RNA  as  genetic  storage.  This 
implies  an  extended  period  of  reverse  transcription  on  a large  scale. 

3.  There  is  some  preliminary  evidence  of  directed  genetic  change  (hence  the  title 
of  this  subsection).  If  true,  this  would  give  a more  solid  basis  for  embedded 
algorithms. 

In  our  opinion,  the  issue  of  how  much  or  little  the  central  dogma  is  violated  in 
nature  is  of  utmost  importance  for  the  utilization  of  embedded  algorithms.  After 
all,  the  basic  premise  for  the  construction  of  the  version  of  genetic  algorithms 
described  here  was  that  nature  has  had  a long  time  to  discover  what  works  and 
what  does  not.  Short  of  a mathematical  proof  that  certain  embedded  algorithms 
are  beneficial,  one  should  be  careful  to  not  overuse  them.  Even  if  directed  genetic 


170 


change  is  proven  some  day,  it  would  still  not  necessarily  be  a common  factor  for 
evolution. 


5.5  Application  to  Finding  Optimal  Linear  Cuts 

As  already  discussed  in  Chapter  3,  finding  optimal  multi-dimensional  linear 
cuts  can  be  a challenging  problem  if  using  local  algorithms,  due  to  the  existence 
of  local  minima.  In  this  section,  we  will  apply  genetic  algorithms  to  this  problem. 

Let  X C be  the  space  of  variables  describing  a collider  event  where  N = 
dim{X)  is  the  number  of  variables.  The  individual  variables,  are  projections 
of  a point  x E X.  For  example,  in  Chapters  7 and  8 the  space  X will  consist  of 
the  first  six  modified  Fox-Wolfram  coefficients,  H^,  and  the  objective  will  be  to 
find  the  best  six-dimensional  hyper-cube  in  this  space  which  contains  most  of  the 
signal  events  and  only  few  of  the  background  events. 

For  each  variable  Xi,  we  introduce  a left  cut  Li  and  a right  cut  Ri.  The  space 
of  cuts,  therefore,  is  a 2N  dimensional  space.  For  an  event  to  survive  the  multi- 
dimensional cuts,  we  require  that 


Li<Xi<Ri  (5.5) 

for  all  z = 1, . . . , A.  In  other  words,  the  event  must  lie  within  the  hyper-cube 
defined  by  the  N pairs  of  left  and  right  cuts. 

At  this  point,  we  must  chose  a performance  measure.  For  example,  we  could 
use  the  average  quadratic  error  of  misclassification.  However,  this  measure  would 
not  take  into  account  that  signal  misclassification  should  be  “penalized”  more 


171 


than  background  events  misclassification.  In  addition,  as  discussed  in  Chapter  4, 
we  cannot  define  the  expectation  value  operator  required  to  compute  the  average 
since  we  do  not  know  the  joint  probability  distribution  between  the  input  space, 
X,  and  the  output  space,  Y = 0, 1,  which  specifies  whether  an  event  survived  the 
cuts  {y  = 1)  or  not  (y  = 0). 

For  that  reason,  we  use  the  statistical  significance  measure,  Rf,  given  by 


where  Nsig  is  the  number  of  signal  events  that  survived  the  multi-dimensional  cuts 
and  Nbak  is  the  number  of  background  events  that  survived  these  cuts. 

The  number  of  signal  events  surviving  the  cuts  can  be  expressed  as 


S N 

^sig  = E n - Li)9{Ri  - (Xi)s)  (5.7) 

5=1  i=\ 

where  S is  the  total  number  of  signal  events  in  the  data  sample  and  the  sum  goes 
over  signal  events  only.  Similarly,  the  number  of  background  events  surviving  the 
cuts  can  be  written  as 


B N 

Nbak  = E n - Li)9(Ri  - (Xi)b)  (5.8) 

6=1  t=l 

where  B is  the  total  number  of  background  events  in  the  sample  and  the  sum  is 
over  background  events  only. 


172 


Using  the  last  two  equations,  we  can  rewrite  Rj  as 

Rf{L,  R)  - - Lj)e{Ri  - 

VEftti  nh  0{{xi)b  - u)e{Ri  - (Xi)6) 

where  the  two  vectors  L,R  £ X represent  the  parameter  space  of  the  real  valued 
performance  measure  Rf.  The  optimal  solution,  therefore,  is  given  by  the  optimal 
cuts  L°  and  R°  which  satisfy  the  relation 


Rf{L‘^,R°)>Rf{L,R)  (5.10) 

for  all  L,  jR  G X,  i.e.  we  have  to  find  the  maximum  of  the  function  R/{L,  R)  over 
the  space  of  all  cuts. 

To  achieve  this  task  with  genetic  algorithms,  we  define  new  “species”  whose 
performance  measure  is  given  by  Eq.  5.9.  These  new  individuals  are  represented 
by  the  class  LinearCuts  derived  from  class  Individual  as  shown  in  Fig.  5.14.  Since 


LinearCuts 


Figure  5.14:  C++  class  diagram  that  shows  the  definition  of  the  new  species 
LinearCuts.  They  are  derived  from  the  base  class  Individual  and  therefore  they 
inherit  the  code  specifying  their  behavior  from  the  base  class. 


173 


the  behavior  of  individuals  is  already  defined  and  implemented  in  class  Individual, 
class  LinearCuts  only  needs  to  implement  the  performance  measure  Rf  and  to 
specify  the  genome  of  LinearCuts. 

Since  the  set  of  2N  cuts  is  the  parameter  space  over  which  we  are  maximizing, 
they  must  be  genes.  In  addition,  as  discussed  earlier,  the  mutation  rate,  Rm,  and 
the  crossover  rate,  Rc,  must  also  be  genes  in  order  to  assure  good  convergence  of 
the  algorithm.  We  will  also  add  the  implicit  gradient  descent  algorithm  described 
in  Section  5.4.6  which  requires  the  single  real  valued  gene  A. 

Without  loss  of  generality,  we  can  assume  that  the  variables  Xi  describing  the 
events  are  in  the  interval  [0,1].  As  a consequence,  Li,Ri  G [0,1]  as  well.  In 
addition,  the  gene  A can  also  be  restricted  to  [0, 1]. 

In  this  particular  realization  of  linear  cuts,  we  also  let  the  mutation  and  cross- 
over rates  be  in  the  range  [0,1].  It  is  important  to  realize,  however,  that  when 
these  rates  go  to  zero,  the  population  will  be  unable  to  find  new  solutions  and, 
consequently,  evolution  should  be  suspended.  This  would  be  a serious  problem  for 
high-dimensional  parameter  spaces  since  there  would  be  many  more  local  minima. 
One  such  case  is  the  training  of  neural  networks  using  genetic  algorithms  which  is 
described  in  the  next  section.  In  that  case,  the  dimension  of  the  parameter  space  is 
usually  100  or  more  and,  for  that  reason,  we  will  not  allow  mutation  and  crossover 
to  go  down  to  zero. 

To  summarize,  the  genome  of  LinearCuts  individuals  consists  of  the  following 
genes: 

• The  mutation  rate,  Rm  G [0, 1]. 


The  crossover  rate,  ^ [0, 1]. 


174 


• The  implicit  gradient  descent  parameter,  A G [0, 1]. 

• The  set  of  left  and  right  cuts  for  each  variable,  Li,Ri  G [0, 1]. 

As  discussed  earlier,  gene  linkage  can  be  an  important  factor  in  the  convergence  of 
the  algorithm.  In  particular,  we  can  see  that  each  pair  of  left  and  right  cuts  forms 
and  operon  since  it  is  meaningless  to  have  one  without  the  other  for  a specific 
event  variable.  For  that  reason,  it  makes  sense  to  order  the  genes  corresponding 
to  the  cuts  as  Li,  Ri, . . . , L^,  Rpf.  We  chose  to  use  a single  chromosome  and  place 
the  genes  as  shown  in  Fig.  5.15. 


Rm 

Rc 

A 

Li 

Ri 

• • • 

Ln 

Rn 

Figure  5.15:  Genetic  map  of  Linear  Cuts  individuals.  The  first  two  genes  are 
implicit  parameters  of  the  genetic  evolution.  The  third  gene.  A,  is  a parameter  of 
the  implicit  gradient  descent  algorithm.  The  first  three  genes  are  not  used  in  the 
performance  measure  Rf.  The  next  2N  genes  are  the  left  and  right  cut  pairs  for 
each  input  variable  Xi.  Their  order  reflects  the  fact  that  each  pair  of  left  and  right 
cuts  for  the  same  variable  forms  an  operon  and,  therefore,  they  should  be  next  to 
each  other. 

To  illustrate  the  synergistic  effect  of  crossover  and  proper  gene  linkage,  let  us 
consider  a single  pair  of  left  and  right  cuts  as  shown  in  Fig.  5.16.  In  this  example, 
we  are  assuming  that  each  gene  is  represented  by  one  byte^'^.  Since  each  of  the 
left,  Li,  and  right,  Ri,  cuts  lie  between  zero  and  one,  we  can  multiply  them  by  255 
and  represent  them  as  a single  byte  (eight  bits)  within  the  computer.  For  example, 
the  gene  corresponding  to  the  left  cut  Li  = 0.251  is  represented  in  the  computer 
as  follows: 


Li  = 0.251 


[01000000]. 


(5.11) 


our  calculations,  we  use  two  bytes  to  represent  each  real  number,  but  for  illustration 
purposes  it  is  simpler  to  consider  just  one  byte. 


175 


Li  Hi 


Parent  1 
Parent  2 


Child 


Figure  5.16:  Crossover  of  two  parental  genes.  A split  position  is  chosen  at  random  within  the 
genes  of  the  parents.  The  child  receives  all  the  bits  to  the  left  from  one  parent  and  all  the  bits  to 
the  right  from  the  other  parent.  Shows  the  two  genes  Lj  = 0.251  and  = 0.5059  for  one  parent 
and  Li  =0.0  and  = 1.0  for  the  other  parent  . For  this  crossover,  the  possible  children  receive 
an  unchcinged  Li  (one  gets  Li  = 0.251  and  the  other  gets  Li  = 0.0)  and  both  get  a modified  Ri 
that  is  a combination  of  the  parental  bits  (i?i  = 0.5137  for  the  one  child  and  = 0.9922  for  the 
other). 


Each  individual  has  a set  of  2N  genes  corresponding  to  the  N pairs  of  left  and  right 
cuts.  These  2N  genes  are  represented  in  the  computer  as  a string  of  2N  bytes.  For 
example,  all  left  cuts  of  zero  and  all  right  cuts  of  one  would  be  represented  as 


Li,Ri,...,Ln,Rn  ->  [00000000] [111111 11]... [00000000] [11111111].  (5.12) 


Since  each  pair  of  left  and  right  cuts  are  next  to  each  other  on  the  chromosome, 
there  is  a high  probability  that  a child  will  inherit  the  pair  unchanged  from  one  of 
the  parents  (in  case  the  crossover  point  is  outside  the  pair  of  cuts) . Even  in  the  case 
of  a crossover  point  within  a pair  of  cuts,  only  one  of  the  cuts  is  modified  which 
has  the  effect  of  “refinement”  of  the  affected  cut  while  keeping  the  other  fixed.  The 
situation  would  be  very  different  if  left  and  right  cuts  for  the  same  variable  were  far 
away  from  each  other  on  the  chromosome  (or  worse  yet,  on  different  chromosomes). 


176 


In  that  case,  it  would  be  very  likely  that  each  child  inherits  the  left  cut  from  one  of 
the  parents  and  the  right  cut  from  the  other.  As  a consequence,  the  convergence  of 
the  algorithm  would  be  diminished  since  the  two  parents  may  represent  completely 
different  solutions  that,  when  mixed,  produce  an  inferior  child.  In  other  words, 
improper  gene  linkage  decreases  the  positive  local  effect  of  crossover. 

In  this  particular  implementation  of  LinearCuts,  the  implicit  gradient  descent 
algorithm  is  applied  immediately  after  crossover  and  mutation.  In  other  words, 
when  a child  is  born,  one  of  the  2N  cuts  is  chosen  at  random  and  the  value  of  A 
is  either  added  to  or  subtracted  from  that  cut,  i.e. 

Li^  Li  ± A or  Ri-^  Ri±  A.  (5.13) 

Of  course,  this  change  must  be  followed  by  a reverse  translation  and  transcription 
in  order  for  it  to  become  a heritable  trait. 

As  discussed  earlier,  the  gene  A adapts  itself  to  the  current  needs  of  the  popu- 
lation. In  particular,  its  effect  is  most  important  when  the  mutation  and  crossover 
rates  are  close  to  zero.  The  existing  solutions  in  the  population  are  then  refined  in 
a local  fashion.  At  this  point  of  evolution,  if  A is  too  large  for  an  individual,  his 
or  her  offspring  will  be  inferior  and,  therefore,  doomed  to  extinction.  On  the  other 
hand,  individuals  with  a “reasonable”  A will  produce  offspring  that  are  better 
performers  since  the  cuts  are  changed  just  by  the  “right”  amount.  Eventually,  the 
gene  A goes  to  zero  which  indicates  that  no  further  improvement  is  possible  and 
evolution  should  be  stopped. 


177 


5.6  Application  to  Training  of  Neural  Networks 


In  Chapter  4,  we  presented  some  of  the  difficulties  associated  with  the  training 
of  neural  networks.  In  particular,  we  showed  that  neural  networks  that  use  the 
statistical  significance  measure,  Rf,  cannot  be  trained  with  local  algorithms.  In 
this  section,  we  apply  genetic  algorithms  to  the  training  of  neural  networks  using 
that  measure.  In  a sense,  training  of  neural  networks  using  genetic  algorithms  can 
be  viewed  as  the  most  generic  approach  to  signal  enhancement.  It  combines  the 
ostensive  nature  of  neural  networks  with  the  robustness  and  generality  of  genetic 
algorithms  as  a means  of  training.  Neither  neural  networks  nor  this  version  of 
genetic  algorithms  depends  on  arbitrary  parameters  that  can  decrease  the  potential 
for  signal  enhancement. 

As  was  the  case  with  multi-dimensional  linear  cuts,  we  represent  a neural  net- 
work as  an  individual  of  a new  species  as  shown  in  Fig.  5.17.  The  class  Neu- 


Figure  5.17:  C-l— f class  diagram  that  shows  the  definition  of  the  new  species 
NeuralNetwork.  They  are  derived  from  the  base  class  Individual  and,  therefore, 
they  inherit  the  code  specifying  their  behavior  from  the  base  class. 


ralNetwork  inherits  the  behavior  implemented  in  class  Individual.  We  only  need 


178 


to  implement  a neural  network  using  the  statistical  significance  measure,  Rf,  as 
defined  in  Eq.  4.19  and  decide  on  the  genetic  map  of  this  species. 

We  implement  a layered  feed-forward  neural  network  in  class  NeuralNetwork^^. 
All  neurons  use  the  sigmoidal  activation  function,  S{x),  with  temperature  r = 1^®. 
The  output  layer  has  one  neuron  only  so  that  we  can  implement  the  measure  Rj. 

As  in  the  case  with  linear  cuts,  the  mutation  rate,  Rm,  and  the  crossover  rate, 
Rc,  are  genes.  We  do  not,  however,  use  the  implicit  gradient  descent  algorithm 
or  any  other  embedded  algorithm.  In  addition,  the  mutation  rate  is  restricted 
in  the  range  Rm  6 [0.001,1]  and  the  crossover  rate  in  the  range  Rc  € [0.9,1]. 
As  a consequence,  crossover  occurs  almost  always  and  there  is  always  some  finite 
probability  for  mutation.  For  that  reason,  evolution  should  never  be  stopped  since 
there  is  always  a finite  probability  for  the  population  to  improve. 

In  addition  to  these  two  genes,  all  synaptic  weights  and  thresholds  are  also 
genes.  They  are  restricted  to  the  range  Wi,tj  G [-10,10]  since  the  sigmoidal 
function  is  very  close  to  its  asymptotic  values  at  the  boundaries  of  this  range. 

Gene  linkage  is  also  important  for  neural  networks,  but  it  is  much  less  clear 
compared  to  the  case  of  multi-dimensional  linear  cuts.  It  is  reasonable  to  place  the 
genes  for  the  threshold  and  the  weights  of  an  individual  neuron  next  to  each  other 
since  they  affect  collectively  its  behavior.  On  the  other  hand,  it  is  not  clear  whether 
there  is  an  actual  advantage  or  disadvantage  of  placing  the  parameters  of  neurons 

within  the  same  layer  next  to  each  other  or  not.  We  use  a single  chromosome  and 

^^Strictly  speaking,  the  layered  feed-forward  neural  network  is  implemented  in  a separate  class 
that  is  used  by  the  species  NeuralNetwork.  It  is,  however,  beyond  the  scope  of  this  work  to 
discuss  proper  C-t— I-  design  issues. 

^®Since  we  are  training  with  genetic  algorithms,  we  could  have  used  step  activation  functions 
as  well. 


179 


the  genetic  map  shown  in  Fig.  5.18.  The  first  two  genes  are  the  mutation  and 


Rm 

Rc 

• • • 

• • • 

• • • 

• • • 

• • • 

• • • 

tW 

wr 

w\ 

• • • 

Figure  5.18:  Genetic  map  of  NeuralNetwork  individuals.  The  first  two  genes  (on 
the  upper  row)  are  implicit  parameters  of  the  genetic  evolution  and  they  are  not 
used  in  the  performance  measure  Rf.  The  subsequent  rows  represent  the  thresholds 
and  the  weights  of  individual  neurons  where  N is  the  total  number  of  neurons.  The 
neuronal  parameters  can  be  viewed  as  an  operon  and  thus  they  are  placed  next  to 
each  other. 


crossover  rates.  The  next  set  of  genes  are  the  threshold  and  the  synaptic  weights 
of  the  first  neuron.  After  that  are  the  parameters  of  the  2nd  neuron,  etc.  We  have 
ordered  the  neurons  as  shown  in  Fig.  5.19  where  N is  the  total  number  of  neurons 


Output 


Figure  5.19:  Shows  the  ordering  of  neurons  for  the  purpose  of  placing  their  pa- 
rameters on  the  single  chromosome  representing  the  genome  of  the  NeuralNetwork 
species. 


in  the  network. 

This  specific  definition  of  NeuralNetwork  species  allows  us  to  train  neural  net- 
works using  the  Rf  measure.  Since  this  cannot  be  done  with  local  algorithms,  this 
feature  alone  extends  the  domain  of  applicability  of  genetic  algorithms  as  com- 


180 


pared  to  local  algorithms.  In  other  words,  using  genetic  algorithms  for  training  of 
neural  networks  is  not  just  a matter  of  preference.  In  some  cases  it  is  a necessity. 

In  addition,  even  in  the  cases  when  local  algorithms  are  an  option,  genetic  al- 
gorithms are  much  less  likely  to  get  “trapped”  in  some  inferior  solution  and  stay 
there.  Even  though  the  CPU  requirements  per  iteration  are  higher  for  genetic  algo- 
rithms than  for  local  algorithms,  it  should  be  emphasized  that  genetic  algorithms 
are  investigating  many  solutions  at  the  same  time.  In  particular,  especially  when 
the  mutation  rate  cannot  go  to  zero,  there  is  no  need  to  perform  many  “runs” 
of  training  and  select  the  best  one^^.  Genetic  training  of  neural  networks  auto- 
matically discards  inferior  solutions  if  they  fail  to  improve  in  the  long  run.  As 
a consequence,  this  version  of  genetic  algorithms  and  NeuralNetwork  species  can 
(and  should)  be  run  for  as  long  as  practical.  There  is  no  need  to  stop  the  evolu- 
tion and  start  another  run  hoping  that  it  will  give  better  results.  This  obviates 
the  need  for  human  intervention  and,  consequently,  it  is  arguable  that  training  of 
neural  networks  with  genetic  algorithms  is  effectively  faster.  The  latter  statement 
implicitly  assumes  that  CPU  time  is  less  expensive  than  the  time  a human  would 
have  to  spend  to  supervise  a collection  of  different  “runs” . 

In  addition  to  these  advantages,  one  can  extend  the  definition  of  the  Neural- 
Network  species  to  include  integer  valued  genes  that  specify  the  number  of  neurons 
in  each  layer  and  even  the  number  of  layers^®.  This  issue  is  extremely  important 

for  very  complex  problems  that  require  a large  number  of  neurons.  If  the  number 

^^When  using  local  algorithms,  this  is  absolutely  necessary  since  the  “best”  local  minimum 
is  usually  not  found  during  the  first  run  even  if  the  local  algorithm  is  preceded  by  a random 
algorithm. 

Specifying  the  number  of  layers  as  a gene  is  somewhat  difficult  since  one  has  to  address  the 
issue  of  genetic  compatibility  between  two  neural  networks  with  different  number  of  layers. 


181 


of  neurons  is  too  large,  the  computational  complexity  increases  significantly  and 
it  becomes  very  difficult  to  train  the  network.  If  these  numbers  are  controlled  by 
genes,  however,  they  become  dynamical  variables  of  the  population  dynamics.  If 
speed  of  network  evaluation  is  part  of  the  sexual  selection  criteria^^,  for  example, 
the  population  should  discover  in  the  long  run  the  best  performers  with  the  mini- 
mum number  of  neurons,  i.e.  implicitly  discover  the  “correct”  number  of  neurons 
needed  to  handle  a specific  problem.  Since  the  number  of  neurons  in  each  layer 
are  integer  numbers,  it  is  impossible  to  use  local  algorithms  to  perform  this  task. 

Another  possible  extension  is  to  use  boolean  valued  genes  that  specify  the 
input  variables  that  should  serve  as  input  stimulus  for  the  neural  network.  Such  an 
approach  would  effectively  “discover”  which  input  variables  are  relevant  and  which 
not.  Of  course,  this  should  never  be  used  as  a substitute  for  careful  selection  and 
representation  of  the  input  data.  However,  sometimes  it  is  just  not  clear  whether 
an  input  variable  can  discriminate  between  signal  and  background  or  whether  it 
is  not  redundant  in  the  presence  of  other  variables.  Such  approach  would  open 
an  entirely  new  domain  of  applicability  of  genetic  algorithms.  It  would  make  it 
possible  to  systematically  investigate  the  relevance  of  input  variables  and  interpret 
the  results  physically. 

Yet  another  useful  extension  would  be  to  use  genetic  algorithms  to  discover  the 
best  discriminant  functions  for  a given  set  of  data.  For  example,  sometimes  neural 
networks  perform  no  better  than  linear  cuts  or  Fisher  discriminates.  This  generally 

reflects  the  fact  that  the  input  data  representation  is  very  good^^.  It  would  be  very 

^^The  version  of  genetic  algorithms  described  in  this  work  is  implemented  to  support  such 
extension,  although  it  has  not  been  used  yet. 

^^In  Chapter  7 , we  will  give  examples  of  such  situations  which  involve  the  usage  of  the  modified 
Fox-Wolfram  moments. 


182 


useful,  however,  to  have  a systematic  way  to  determine  whether  this  is  the  case 
or  not.  One  easy  way  to  do  that  with  genetic  algorithms  is  to  define  the  genetic 
map  shown  in  Fig.  5.20.  The  1st  chromosome  contains  the  genes  for  mutation 


Chromosome 

Genes 

1 

Rm  Rc  M other  general  genes 

2 

neural  network  genes 

3 

linear  cuts  genes 

4 

some  other  discriminant  function 

5 

yet  another  discriminant  function 

6 

• • • 

Figure  5.20:  Extended  genetic  map  of  individuals  that  use  various  discriminant 
functions  at  the  same  time.  The  first  two  genes  on  chromosome  1 are  the  muta- 
tion and  crossover  rates.  The  third  gene  on  chromosome  1,  M,  controls  which  of 
the  discriminant  functions  will  be  expressed.  It  is  necessarily  an  integer  number. 
Chromosomes  2,  3,  etc.  carry  genes  for  different  discriminant  functions.  During 
the  life  of  an  individual,  only  one  of  them  is  expressed  and  used  to  evaluate  per- 
formance (possibly  normalized  for  CPU  time  consumption).  However,  the  entire 
genome  is  heritable. 


and  crossover,  the  gene  M and  any  other  general  purpose  genes  (for  example, 
genes  coding  for  life  expectancy!).  The  integer  valued  gene  M codes  for  which 
discriminant  function  to  use  during  the  life  time  of  the  individual.  Sometimes  it 
would  be  neural  networks,  another  time  linear  cuts,  etc.  As  a consequence,  the 
population  would  be  evolving  with  respect  to  all  discriminant  functions,  since  they 
are  all  heritable  regardless  of  whether  they  were  expressed  or  not^^.  At  the  same 

time,  some  of  the  discriminant  functions  will  perform  better  than  others  and  this 

should  be  noted  that  genes  coding  for  the  expression  of  other  genes  is  a quite  common 
(although  poorly  understood)  phenomenon  in  biological  individuals 


183 


will  result  in  a differential  survival  rate.  Eventually,  the  gene  M will  be  the  same 
for  all  individuals  and  reflect  the  best  discriminant  function  for  this  set  of  data^^. 

There  are  many  other  useful  extensions  of  genetic  algorithms  many  of  which 
involve  neural  networks.  One  could,  for  example,  use  a small  neural  network  to 
make  more  complex  sexual  selection  decisions.  In  that  case,  this  neural  network 
would  correspond  to  brain  functions  that  are  genetically  “hard-coded”.  Another 
neural  network  could  look  at  the  data  and  learn  to  separate  signal  from  background. 
It  is  beyond  the  scope  of  this  work  to  even  review  all  possible  extensions  and  their 
implications.  It  is  our  opinion,  however,  that  genetic  algorithms  will  play  an 
increasingly  important  role  in  high  energy  physics.  It  may  take  time  for  people  to 
realize  their  advantages  and  learn  how  to  diminish  their  disadvantages.  Eventually, 
however,  genetic  algorithms  should  become  a primary  tool  for  high  energy  physics 
data  analysis.  The  most  useful  point  of  view  about  genetic  algorithms  is  that  they 
extend  current  methods,  not  replace  them.  Existing  local  algorithms  can  be  easily 
embedded  as  we  illustrated  in  this  chapter.  Some  performance  measures  that  are 
important  in  high  energy  physics  can  be  used  with  genetic  algorithms,  but  not 
with  local  algorithms.  Integer  valued  selection  criteria  (such  as  jet  multiplicity, 
for  example)  are  equally  easy  to  handle  with  genetic  algorithms,  but  impossible  if 

^^The  issue  of  using  multiple  discriminant  functions  is  somewhat  more  complicated  since,  in 
general,  neural  networks  will  give  the  best  performance  in  the  long  run,  but  not  necessarily  in 
the  short  run.  As  a consequence,  linear  cuts,  for  example,  could  force  the  population  to  contain 
very  few  neural  networks  and,  thus,  not  allow  individuals  expressing  neural  networks  to  improve 
themselves.  One  way  to  solve  this  problem  is  to  extend  the  sexual  behavior  implemented  in  class 
Individual  to  allow  mate  selection  on  the  basis  of  which  discriminant  function  is  expressed.  This 
would  be  analogous  to  prefering  black  hair  vs.  brown,  for  example.  This  approach  would  gener- 
alize the  notion  of  performance  in  the  context  of  genetic  algorithms  since  performance  would  also 
depend  on  preferences  and  not  be  the  same  as  the  performance  measure  of  a particular  discrim- 
inant function.  In  nature,  this  generalization  is  the  rule  and  consequently,  this  solution  would 
be  consistent  with  the  overall  approach  of  this  version  of  genetic  algorithms  which  essentially 
consists  of  “copying”  what  nature  has  discovered  after  4 billion  years  of  evolution. 


184 


using  local  algorithms.  In  summary,  genetic  algorithms  offer  a systematic  way  to 
investigate  various  issues  with  currently  existing  (generally  unsystematic)  methods 
and  at  the  same  time  they  can  be  used  to  solve  problems  that  current  methods 
cannot  handle. 


CHAPTER  6 

ENHANCING  THE  HIGGS  BOSON  SIGNAL 


The  electro-weak  symmetry  breaking  in  the  Standard  Model  implies  the  exis- 
tence of  the  Higgs  boson[23,  24].  Apart  from  the  r neutrino,  this  is  the  only  particle 
in  the  minimal  version  of  the  Standard  Model  that  has  not  been  discovered  yet. 

Experimental  data  accumulated  so  far  supports  all  other  predictions  of  the 
Standard  Model.  It  is,  therefore,  reasonable  to  expect  that  the  Higgs  boson  will 
eventually  be  discovered.  The  Large  Hadron  Collider  (LHC)  that  is  under  con- 
struction at  CERN  will  be  sufficient  to  prove  or  disprove  the  existence  of  Higgs 
within  five  years. 

In  this  chapter,  we  use  a combination  of  “traditional”  cuts  and  neural  networks 
to  better  discriminate  between  the  Higgs  signal  at  LHC  and  the  ordinary  QCD 
background [25].  Since  the  mass  of  the  Higgs  boson,  ttih,  is  not  predicted  by  the 
Standard  Model,  we  restrict  our  analysis  to  mn  = 400  GeV^. 

One  of  the  relevant  signatures  of  the  production  of  a 400  GeV  Higgs  particle 
is  its  decay  into  two  Z bosons,  i.e.  H ZZ.  Each  Z decays  leptonically  about 
10%  of  the  time  and  hadronically  about  70%  of  the  time.  The  case  of  both  Z 
bosons  decaying  leptonically  {ZZ  is  called  the  “discovery”  mode  for 

Higgs.  However,  there  are  very  few  events  that  correspond  to  this  mode  due  the 

low  probability  of  both  Z’s  decaying  into  leptons. 

^ Since  the  time  this  work  was  done,  there  are  indications  that  the  mass  of  Higgs  is  probably 
much  smaller  than  400  GeV.  Nevertheless,  the  techniques  used  in  this  chapter  can  be  applied  to 
other  mass  ranges  as  well  even  though  the  analysis  would  have  to  be  re-done. 


¥ 


185 


186 


Another  decay  mode  occurs  when  one  of  the  Z decays  leptonically  and  the 
other  Z decays  hadronically  into  a qq  pair.  The  hadronic  decay  manifests  itself 
as  a pair  of  jets.  In  other  words,  the  signature  of  a Higgs  signal  event  for  this 
decay  mode  consists  of  a large  transverse  momentum  charged  lepton  pair  plus  two 
accompanying  jets  (i.e.,  t^i~  jj).  Due  to  the  higher  probability  of  a Z decaying  into 
hadrons,  there  are  more  signal  events  for  this  decay  mode  than  for  the  discovery 
mode.  The  predominant  background  for  this  process  is  a single  large  transverse 
momentum  Z boson  plus  the  associated  jets  that  resemble  the  Higgs  boson  signal. 
Requiring  the  Z boson  to  have  a large  transverse  momentum  by  demanding  a 
large  Pp  lepton  pair  forces  the  background  to  have  a large  Pp  “away-side”  quark 
or  gluon  via  subprocesses  like  qg  — > Zq  and  qq  Zg.  This  away-side  parton 
often  fragments  via  gluon  bremsstrahlung  and  produces  away-side  jet-pairs  which 
resemble  the  signal. 

We  will  use  neural  networks  to  help  distinguish  the  ZZ  decay  of  a 

400  GeV  Higgs  boson  signal  from  the  Z-|-jets  background  in  proton-proton  colli- 
sions at  15  TeV.  The  neural  network  will  be  used  in  conjunction  with  the  standard 
data  cuts  to  provide  additional  signal  to  background  enhancement.  Here  we  in- 
vestigate whether  neural  networks  can  help  with  the  ”jet-physics”  of  the 
mode.  Any  progress  made  here  can  be  applied  to  the  WW  — > £ujj  decay  mode  of 
the  Higgs  boson  as  well[26,  27]. 

We  will  not  try  to  give  a detailed  simulation  of  an  experiment  at  the  LHC. 
Nevertheless,  the  detector  parameters  we  use  are  close  to  the  actual  parameters 
at  LHC.  Higgs  boson  production  at  a 15  TeV  proton-proton  collider  is  used  as  an 
illustration  of  neural  networks  as  a tool  in  high  energy  jet  phenomenology.  We  have 


187 


designed,  constructed,  and  tested  the  networks  presented  here  from  the  beginning 
with  the  emphasis  on  high  energy  data  analysis. 

Unlike  many  other  colliders,  the  frequency  of  interactions  at  LHC  will  be  so 
high  that  there  can  be  multiple  interactions  per  beam  crossing.  This  is  called 
pile-up  and  results  in  multiple  events  entering  the  detector  at  the  same  time.  For 
that  reason,  we  consider  both  the  case  of  single  and  multiple  interactions  per  beam 
crossing. 


6.1  Data  Generation  and  Cuts  Without  Pile-Up 
6.1.1  Event  Generation 

We  consider  first  the  ideal  case  where  only  one  event  at  a time  enters  the 
detector.  We  want  to  determine  whether  neural  networks  can  be  trained  to  dis- 
tinguish between  the  Higgs  boson  signal  and  the  Z-|-jets  background  when  there 
is  no  pile-up.  ISAJET  version  7.06  is  used  to  generate  Higgs  bosons  with  a mass 
of  400  GeV  in  15  TeV  proton-proton  collisions.  The  generated  width  of  the  Higgs 
is  about  30  GeV.  The  Higgs  boson  is  forced  to  decay  into  two  Z bosons  with 
one  Z decaying  leptonically  and  the  other  Z decaying  into  a quark-antiquark 
pair.  We  refer  to  this  as  the  ’’signal”.  The  “background”  consists  of  single  Z 
boson  events  generated  with  the  hard-scattering  transverse  momentum  of  the  Z, 

ys 

A:r,  greater  than  100  GeV.  Single  Z bosons  are  produced  at  large  transverse  mo- 
mentum via  the  “ordinary”  QCD  subprocesses  qg  Zq,  qg  Zq,  and  qq  Zg. 
These  subprocesses,  of  course,  generate  additional  gluons  via  bremsstrahlung  off 
both  incident  and  outgoing  color  non-singlet  partons,  resulting  in  multiparton  fi- 


188 


nal  states  which  subsequently  fragment  into  hadrons,  and  this  is  referred  to  as  the 
Z+jets  background. 

We  are  not  attempting  to  do  a detailed  simulation  of  an  LHC  detector [28,  29]. 
Events  are  analyzed  by  dividing  the  solid  angle  into  “calorimeter”  cells  having  size 
Ar]A(f)  = 0.2  X 15°,  where  77  and  (j)  are  the  pseudorapidity  and  azimuthal  angle, 
respectively.  A single  cell  has  an  energy  (the  sum  of  the  energies  of  all  the  particles 
that  hit  the  cell  excluding  neutrinos)  and  a direction  given  by  the  coordinates  of 
the  center  of  the  cell.  From  this  the  transverse  energy  of  each  cell  is  computed  from 
the  cell  energy  and  direction.  Large  transverse  momentum  leptons  are  analyzed 
separately  and  are  not  included  when  computing  the  energy  of  a cell.  Jets  are 
defined  using  a simple  algorithm.  One  first  considers  the  “hot”  cells  (those  with 
transverse  energy  greater  than  5GeV).  Cells  are  combined  to  form  a jet  if  they 
lie  within  a specified  “distance”  or  “radius”,  = Aif  + AcjP',  in  rj-cj)  space  from 
each  other.  Jets  have  an  energy  given  by  the  sum  of  the  energy  of  each  cell  in  the 
cluster  and  a momentum  pj  given  by  the  vector  sum  of  the  momentums  of  each 
cell.  The  invariant  mass  of  a jet  is  simply  Mj  = Ej  — pj  • pj. 

We  have  taken  the  energy  resolution  to  be  perfect,  which  means  that  the  only 
resolution  effects  are  caused  by  the  lack  of  spatial  resolution  due  to  the  cell  size. 
However,  we  are  using  a very  crude  calorimeter  with  large  cells  (960  cells  with 
\r]\  < 4).  Experiments  at,  for  example,  the  LHC  [3,4]  will  have  considerably 
smaller  cell  size  and  hence  better  spatial  resolution.  Even  with  the  addition  of 
energy  resolution  effects,  the  combined  spatial  and  energy  resolution  at  the  LHC 
should  be  comparable  to  or  better  than  in  our  analysis. 


189 


6.1.2  Lepton  Trigger 

Our  “zero-level”  trigger  is  designed  to  select  large  transverse  momentum  Z 
bosons  that  have  decayed  into  charged  leptons.  The  first  cut  is  made  by  demanding 
that  the  event  contain  at  least  two  high  transverse  momentum  leptons 
or  n^)  in  the  central  region  as  follows: 

Pt{£^)  > 25  GeV,  \ < 2.5.  (6.1) 

Lepton  pairs  (e''‘e~  and  are  constructed  for  the  events  that  survive  this 

first  cut.  The  pairs  are  ordered  according  to  their  invariant  mass,  with  pair  #1 
having  the  mass  closest  to  the  Z boson  and  pair  #2  being  the  second  closest,  etc. 
Finally,  the  event  is  rejected  unless  at  least  one  lepton  pair  satisfies  the  following: 

> lOOGeV.  (6.2) 

Table  6.1  shows  that  for  a 400  GeV  Higgs  at  15  TeV,  roughly  10, 000  events  per 
year  pass  this  “zero  level  ” trigger.  Here  the  integrated  luminosity  for  one  year 
is  taken  to  be  the  expected  LHC  value  of  10®/pb.  About  2 million  background 
events  per  year  survive  this  “zero  level  ” lepton  cut. 

This  high  transverse  lepton  pair  cut  is,  of  course,  crucial.  The  transverse  mo- 
mentum spectrum  of  the  single  Z QCD  background  falls  off  rapidly,  while  for  the 
heavy  Higgs  the  signal  is  peaked  at  about  half  the  mass  of  the  Higgs.  Here  one 
wants  to  take  as  large  a cut  on  Pr{i'^£~)  as  possible  without  losing  too  much  of  the 


190 


signal.  However,  even  with  this  cut,  the  background  is  still  more  than  200  times 
the  signal! 

6.1.3  Jet-Pair  Selection 

The  jet  topology  of  events  with  at  least  one  large  transverse  momentum  lepton 
pair  is  analyzed  by  first  examining  only  jet  cores  (i.e.,  narrow  jets  of  size  Rj{core)). 
Here  one  includes  only  those  jet  cores  satisfying, 

E^T'Qetcore)  > 25GeV  , |r7(jetcore)|  < 3,  (6-3) 

with 

(core)  = 0.2.  (6.4) 

In  an  attempt  to  find  the  two  jets  produced  by  the  hadronic  decay  of  the 
large  transverse  momentum  Z boson,  jet  pairs  are  formed  by  demanding  that  the 
distance  between  the  two  jet  cores  in  rj-(l)  space,  = (771  — 772)^  + (</’i  — 4>2Y,  be 
less  than  1.6.  Namely, 

djj(jet  — jetcores)  < 1.6.  (6.5) 

In  addition,  the  jet-jet  cores  are  required  to  satisfy 

> 100  GeV,  - M > 90°,  (6.6) 

where  is  the  total  transverse  momentum  of  the  core  jet-pair  and  is  the 

azimuthal  angle  between  the  leading  lepton  pair  and  the  core  jet-pair.  The  jet-pair 
is  required  to  be  in  the  opposite  hemisphere  (or  “away-side” ) from  the  lepton  pair. 


191 


Events  per  year 
inlOOeVbin 

1DOJOOt<v 


400  GeV  Higgs  in  1 S TeV 
pp  coisions 


JefoJet  invarfant  Mass 


Higgs->2Z  no  pile-up  -x-  Z^Jets  no  pile-up 
-0“  Higgs->ZZ  with  pile-up  -o—  Z+ Jets  with  pile-up 


JaKietMass  (GeV) 


Figure  6.1:  Shows  the  away-side  jet-jet  mass  for  a 400  GeV  Higgs  boson  produced 
in  15  TeV  p-p  collisions.  The  plot  corresponds  to  the  number  of  events  per  year 
(Lum=  10^/pb)  in  a 10  GeV  bin  for  the  H ZZ  signal  and  the  Z-t-jets  back- 
ground. The  ideal  case  where  only  one  event  at  a time  enters  the  detector  (no 
pile-up)  and  the  case  of  multiple  interactions  per  beam  crossing  (with  pile-up)  are 
shown.  In  all  cases  the  events  have  survived  the  “zero-level”  lepton  trigger  and 
the  jet-pair  selection  criterion. 

If  more  than  one  jet-pair  meets  all  of  these  requirements  then  the  pair  with  the 
largest  total  transverse  energy  is  selected. 

Table  6.1  shows  that  of  the  10, 000  signal  events  passing  the  “zero  level”  lepton 
trigger  about  49%  also  pass  the  jet-pair  selection  criterion.  Unfortunately,  about 
30%  of  the  ordinary  Z-|-jets  background  events  that  survive  the  “zero  level”  lepton 
trigger  also  have  a jet-pair  meeting  the  selection  criteria. 

We  use  the  enhancement,  F^nh,  and  the  efficiency,  Fg//,  defined  in  Chapter  3 
as  a measure  the  effectiveness  of  a particular  cut.  The  jet-pair  selection  criterion 
results  in  an  enhancement  of  1.6  with  an  efficiency  of  about  49%.  The  “zero  level” 
lepton  trigger  is  used  as  a reference  point  and  is  normalized  to  an  efficiency  of  100% 


192 


and  an  enhancement  of  one.  One  might  have  expected  to  do  better  at  this  stage. 
However,  once  we  require  that  the  Z boson  have  a large  transverse  momentum,  we 
force  the  background  to  have  a large  away-side  quark  or  gluon  jet.  This  away- 
side  parton  often  fragments  via  gluon  bremsstrahlung  into  multiple  away-side  jets 
which  then  survive  the  selection  criteria. 


Table  6.1:  400  GeV  Higgs  boson  produced  in  15  TeV  p-p  collisions.  The  table 
shows  the  number  of  events  per  year  (with  Lum=10^/pb)  for  the  H ZZ  signal 
and  Z-l-jets  background  for  the  ideal  case  where  only  one  event  at  a time  enters 
the  detector  (i.e.,  no  pile-up).  The  “zero-level”  lepton  trigger  is  used  as  a reference 
point  and  is  normalized  to  100%.  The  enhancement  factor  is  defined  to  be  the 
percentage  of  signal  divided  by  the  percentage  of  background  surviving  the  given 
set  of  cuts. 


Selection 

Signal 
H ZZ 

Background 
Z-h  jets 

Back/ 

Enhancement 

Factor 

Cut 

% 

Overall 

Events/ 

year 

% 

Overall 

Events/ 

year 

Sig 

Relative 

Overall 

Lepton  trigger 
Pt[i)  > 25  GeV 
PT(II)  > 100  GeV 

100% 

10185 

100% 

1961818 

193 

1.0 

1.0 

Jet  pair  selection 
ET{j)  > 25  GeV 
PT{jj)  > 100  GeV 

49.0% 

4995 

30.4% 

595622 

119 

1.6 

1.6 

Z-mass  cut 
%\<Mz  < 101  GeV 

25.0% 

2551 

2.3% 

44244 

17 

6.9 

11.1 

if-mass  cut 
350  <Mh  < 450  GeV 

22.0% 

2241 

0.7% 

14471 

6.5 

2.7 

■ 

29.8 

Z-mass  & net  cut 
81  < Mz  < 101  GeV 
net  cut  >0.75 

10.4% 

1060 

0.2% 

3683 

3.5 

5.0 

55.4 

i/-mass  net  cut 
350  <Mh  < 450  GeV 
net  cut  > 0.75 

9.4% 

954 

0.1% 

1862 

2.0 

3.3 

98.7 

193 


6.1.4  Invariant  Mass  Cuts 

The  invariant  mass,  Mjj(full),  is  constructed  by  using  all  cells  that  lie  within 
a “distance”  i?jj(full)  in  r]-(j)  space  of  either  of  the  two  jets.  Cells  are  not  double 
counted.  For  example,  a cell  may  lie  within  of  both  jets,  nevertheless  it 

is  counted  just  once.  The  aim  here  is,  of  course  to  reconstruct  the  invariant  mass 
of  the  Z boson  as  shown  in  Figure  6.1.  However,  this  full  jet-jet  invariant  mass 
will  only  be  used  in  the  event  selection.  The  Higgs  mass  will  be  reconstructed  by 
setting  Mjj  = Mz-  At  this  stage,  events  are  rejected  unless  the  full  jet-jet  mass 
satisfies: 

81  < Mjj  (full)  < 101,  (6.7) 

with 

i?jj(full)  = 0.6.  (6.8) 

As  can  be  seen  in  Figure  6.1  and  Table  6.1,  about  51%  of  the  Higgs  signal 
passing  both  the  lepton  cut  and  the  jet-pair  selection  have  Mjj  within  10  MeV  of 
the  Z boson  mass.  On  the  other  hand,  only  about  7%  of  the  Z-|-jets  background 
events  surviving  both  the  lepton  cut  and  the  jet-pair  selection  have  a full  jet-pair 
invariant  mass  within  10  MeV  of  the  Z boson  mass.  This  corresponds  to  an  overall 
enhancement  factor  at  this  stage  of  about  11  with  an  overall  efficiency  of  about 
25%.  The  background  lies  well  above  the  signal  in  Figure  6.1  so  that  one  cannot 
directly  see  the  Z mass  peak.  Nevertheless,  the  jet-jet  invariant  mass  cut  is  very 
important. 


194 


Figure  6.2;  Shows  the  reconstructed  mass  of  a 400  GeV  Higgs  boson  produced  in  15 
TeV  p-p  collisions.  The  plot  corresponds  to  the  number  of  events  per  year  (Lum= 
10^/pb)  in  a 25  GeV  bin  for  the  H — » ZZ  signal  and  the  Z+jets  background  for 
the  ideal  case  where  only  one  event  at  a time  enters  the  detector  (no  pile-up). 
The  events  have  survived  the  “zero-level”  lepton  trigger  and  the  jet-pair  selection 
criterion  with  81  < M,j(full)  ^ 101  GeV.  No  network  cut  has  been  made. 


The  Higgs  invariant  mass  is  constructed  from  the  momentum  vectors  of  the 
two  charged  leptons  and  the  momentum  vector  of  the  jet-pair  as  follows: 


— {E£+  + E£-  -f  Ejj)"^  — {p£+  + p£-  + pjjY  , (6.9) 


where 

= Pjj  ■ Pjj  + ■ (6.10) 

The  mass  of  a jet  is  not  a well  defined  quantity  since  it  depends  on  the  soft 
particles.  The  momentum  vector  of  a jet  is  better  defined  and  is  determined 
primarily  by  the  core  cells.  Thus,  in  constructing  the  Higgs  mass  we  use  the 


195 


momentum  vector  of  the  jet-pair  but  not  the  jet-pair  mass.  The  mass  of  the 
jet-pair  is  set  equal  to  the  mass  of  the  Z boson. 

Figure  6.2  shows  the  reconstructed  Higgs  mass  for  both  the  signal  and  back- 
ground events  that  have  passed  the  lepton  cuts,  the  jet-pair  selection,  and  have 
81  < Mjj(full)  < 101  GeV.  At  this  stage,  there  are  about  2, 000  Higgs  boson  events 
and  14, 000  QCD  background  events  per  year  within  50  GeV  of  the  true  Higgs  mass 
of  400  GeV.  This  corresponds  to  an  overall  enhancement  factor  of  about  30  (see 
Table  6.1  ) with  an  overall  efficiency  of  about  22%.  However,  even  with  this  en- 
hancement the  Z-l-jets  background  is  still  more  than  6 times  the  signal.  It  is  at 
this  stage  that  neural  networks  will  be  used  to  provide  an  additional  enhancement 
of  signal  over  background. 

6.2  Network  Analysis  Without  Pile-up 

We  will  train  a neural  network  to  distinguish  between  the  signal  and  background 
events  that  have  already  passed  the  lepton  cuts,  the  jet-pair  selection,  and  have 
81  < Mjj(full)  < 101  GeV.  These  important  cuts  are  made  before  sending  the 
events  to  the  network.  Even  though  both  the  signal  and  background  events  have 
survived  these  cuts,  there  is  still  additional  information  in  the  events  that  is  not 
the  same  for  the  signal  and  the  background.  The  network  can  use  these  differences 
to  further  help  distinguish  signal  from  background. 


196 


4et  Multiplicity 


^00  GeV  Higgs  in  1 5 TeV  pp  collisions 
81  < Mjj  < 1 01  GeV  no  pile-up 


0 2-168 

Number  of  Jets  with  ET  > 5 GeV 


10 


12 


1^ 


16 


18 


BHtgg$*>ZZ  Signd  □Z-«'Jet8  Background 


Figure  6.3:  Shows  the  multiplicity  of  jets  for  400  GeV  Higgs  bosons  produced  in  15 
TeV  p-p  collisions.  The  plot  corresponds  to  the  percentage  of  events  with  N jets 
with  Et  greater  than  5 GeV  for  the  H —¥  ZZ  signal  and  the  Z-l-jets  background 
for  the  ideal  case  where  only  one  event  at  a time  enters  the  detector  (no  pile-up). 
The  events  have  survived  the  “zero-level”  lepton  trigger  and  the  jet-pair  selection 
criterion  with  81  < M,j(full)  < 101  GeV. 

6.2.1  Network  Inputs 

One  of  the  important  factors  in  successful  application  of  neural  networks  is 
the  proper  selection  of  input  variables.  These  variables  must  characterize  the 
differences  between  the  signal  and  the  background.  In  this  analysis  we  choose  the 
following  nine  input  variables: 


= 14(1) -4(2)|/(4(i)  + 4(2)), 

X,  = Afj„(Br>5GeV), 

X/^  = Ex{Rjj  ^ < 1.0), 


197 


3^5  — £'j’(0.2  < Rjj  < O.Q) f Ej'i^Rjj  < I'O), 

xq  = Ex(O.Q  < Rjj  < 1.0)/£'7’(i?jj  < I'O), 

xr  = M{Rjj  < 0.2)/M{Rjj  < 1.0), 

xs  = M(0.2  < Rjj  < 0.6) /M {Rjj  < 1.0), 

xg  = M(0.6  < Rjj  < l.0)/M{Rjj  < 1.0).  (6.11) 

The  first  variable  is  simply  the  distance  in  space  between  the  two  “away- 
side”  jets  selected  in  the  jet-pair  selection.  For  the  signal  this  is  related  to  the 
opening  angle  of  the  quark-antiquark  pair  resulting  from  the  Z ^ qq  decay,  while 
for  the  background  this  is  the  distance  between,  for  example,  an  outgoing  quark 
and  the  radiated  gluon  jet.  The  second  variable  is  the  ’’skewness”  of  the  transverse 
energies  of  the  two  jets  cores,  while  the  third  variable  is  simply  the  overall  number 
of  jets  (with  Et  > 5 GeV)  in  the  event  and  is  shown  in  Figure  6.3 

The  remaining  variables  depict  the  precise  manner  in  which  transverse  energy 
and  mass  are  distributed  around  the  away-side  jet-pair.  For  example,  xq  is  the 
ratio  of  the  amount  of  transverse  energy  coming  from  calorimeter  cells  within 
the  ’’halo”  region  0.6  < Rjj  < 1.0  surrounding  both  jets  to  the  total  transverse 
energy  of  the  extended  jet-pair  (i?jj  (extended)  = 1.0).  As  can  be  seen  in  Figure 
6.4,  the  fraction  of  transverse  energy  in  this  region  is,  on  the  average,  slightly 
larger  for  the  background  than  for  the  signal.  Similarly,  3:9  is  the  fraction  of  the 
full  jet-jet  invariant  mass  that  comes  from  calorimeter  cells  in  the  “halo”  region 
0.6  < Rjj  < 1.0.  Figure  6.5  shows  that  more  of  the  extended  jet-jet  mass  lies  in 
this  region  for  the  background  than  for  the  signal.  The  other  halo  regions  also 


198 


Transverse  Energy  Fraction 


400  GeV  Higgs  in  1 5 TeV  pp  collisions 
81  < Mjj  < 1 01  GeV  no  pile-up 


0.0125  0.0625  0.1125  0.1 625 1 

• ■ • • 'iijj 

ET(0.6  < R <t.O)  / ET(R<t  Signal  O Z+Jets  Background 


Figure  6.4:  Shows  the  fraction  of  transverse  energy  coming  from  calorimeter  cells 
within  the  “halo”  region  0.6  < Rjj  < 1.0  surrounding  either  of  the  away-side  jets. 
The  plot  corresponds  to  the  percentage  of  events  with  the  jet-jet  transverse  energy 
fraction  within  the  0.025  bin  for  the  H ZZ  signal  and  the  Z-l-jets  background 
for  the  ideal  case  where  only  one  event  at  a time  enters  the  detector  (no  pile-up). 
The  events  have  survived  the  “zero-level”  lepton  trigger  and  the  jet-pair  selection 
criterion  and  have  81  < M,j(full)  < 101  GeV. 


show  slight  variations  between  signal  and  background  which  the  network  can  use 
to  help  distinguish  between  the  two. 

The  idea  here  is  similar  to  the  jet-jet  profile  analysis  we  presented  in  [26].  For 
the  signal,  the  away-side  jet-pair  arises  from  the  qq  decay  of  a large  transverse 
momentum  Z boson.  The  Z boson  is  a color  singlet  and  does  not  radiate  gluons 
during  flight.  On  the  other  hand,  the  large  Pt  away-side  recoil  quarks  or  gluons  in 
the  single  Z background  are  not  color  singlets  and  produce  additional  gluons  via 
bremsstrahlung.  These  radiated  gluons  deposit  transverse  energy  around  the  jet- 
jet  cores.  This  results  in  more  transverse  energy  and  invariant  mass  surrounding 
the  jet-jet  cores  for  the  Z-l-jets  background  than  for  the  Higgs  boson  signal.  The 


199 


Figure  6.5:  Shows  the  fraction  of  invariant  mass  coming  from  calorimeter  cells 
within  the  “halo”  region  0.6  < Rjj  <1.0  surrounding  either  of  the  away-side  jets. 
The  plot  corresponds  to  the  percentage  of  events  with  the  jet-jet  invariant  mass 
fraction  within  the  0.05  bin  for  the  H ZZ  signal  and  the  Z-hjets  background 
for  the  ideal  case  where  only  one  event  at  a time  enters  the  detector  (no  pile-up). 
The  events  have  survived  the  “zero-level”  lepton  trigger  and  the  jet-pair  selection 
criterion  and  have  81  < Mjj(full)  < 101  GeV. 


distribution  of  transverse  energy  and  invariant  mass  around  the  “away-side”  jet- 
pair  is  slightly  different  in  the  two  cases. 


6.2.2  Network  Structure  and  Training 

Since  the  input  data  sample  is  time  independent,  we  chose  a feedforward  neural 
network.  Furthermore,  in  order  to  apply  the  error  backpropagation  algorithm  in 
its  simplest  form,  we  chose  a layered  network  with  2 hidden  layers.  Each  neuron 
uses  the  sigmoidal  activation  function  and  the  performance  measure  is  the  mean 
square  quadratic  error,  E,  defined  in  Chapter  4.  We  can  write  this  performance 


200 


measure  as 

1 ^^^9  1 Nbak 

E If  + ^ t.{Zn  - Of  (6.12) 

n=l  ^^bak  n=l 

where  Ngig  and  Nhak  are  the  number  of  signal  and  background  events  in  the  data 
sample  and  is  the  output  of  the  network  for  the  nth  input.  A perfect  separation 
of  signal  and  background  would  imply  that  the  network  always  responds  with  1 
for  signal  events  and  0 for  background  events.  As  a consequence,  E would  be  zero 
in  such  an  ideal  case.  A completely  random  response  would  result  in  £■  = 0.25. 

The  network  is  trained  on  a sample  of  8,348  signal  and  7,254  background 
events  using  the  nine  inputs  shown  earlier  and  where  both  signal  and  background 
events  have  already  satisfied  the  lepton  cuts,  the  jet-pair  selection,  and  have  81  < 
Mjj(full)  < 101  GeV.  To  get  this  training  sample,  it  was  necessary  to  generate 
80, 000  Higgs  boson  events  and  800, 000  .Z^-Hjet  events.  We  experimented  with  a 
variety  of  network  sizes  and  present  here  the  results  from  a 9-16-8-1^  net  which  has 
305  thresholds  and  synaptic  weights.  After  a lengthy  training  process  we  achieved 
E = 0.1678  on  the  training  sample  which  indicates  an  improvement  since  this  value 
is  less  than  0.25. 

6.2.3  Network  Performance 

Figure  6.6  shows  the  network  response  (i.e.,  Znet)  for  the  sample  of  signal  and 
background  events  used  in  the  training.  The  situation  is  far  from  the  ideal.  There 
are  some  events  around  Zmt  = 0.5  for  which  the  net  cannot  distinguish  between 

signal  and  background.  Nevertheless,  the  net  does  allow  for  some  separation  of 

network  with  9 inputs,  16  neurons  in  the  first  hidden  layer,  8 neurons  in  the  second  hidden 
layer  and  1 neuron  in  the  output  layer. 


201 


Figure  6.6:  Shows  the  network  response,  Znet,  for  the  sample  of  signal  and  back- 
ground events  used  in  the  training  and  for  an  independent  sample  of  signal  and 
background  events.  The  plot  corresponds  to  the  percentage  of  events  with  z^et 
within  a 0.05  bin  for  the  H ZZ  signal  and  the  Z+jets  background  for  the  ideal 
case  where  only  one  event  at  a time  enters  the  detector  (no  pile-up).  The  events 
have  survived  the  “zero-level”  lepton  trigger  and  the  jet-pair  selection  criterion 
and  have  81  < Mjj(full)  <101  GeV. 

signal  and  background.  The  net  clearly  recognizes  some  events  as  signal  or  back- 
ground, while  for  other  events  there  is  an  overlap  and  the  net  cannot  distinguish 
between  the  two.  Ideally  one  would  like  a clean  separation  between  the  signal 
and  background  in  Figure  6.6.  One  would  then  perform  a network  cut-off  and 

assign  any  event  with  Znet  > Zcut  to  be  signal  and  events  with  z^et  < Zcut  to  be 
background. 

Figure  6.6  also  shows  the  network  response  (i.e.,  Znet)  for  an  independent  sample 
of  signal  and  background  events  not  used  in  the  training.  If  the  network  generalized 
perfectly  there  would  be  no  difference  between  the  response  of  the  network  for  the 
independent  and  the  training  samples.  The  small  differences  seen  in  Figure  6.5 


202 


reflect  that  fact  that  we  have  trained  the  net  on  a relatively  small  sample  of  events. 
We  could  improve  the  ability  of  the  network  to  generalize  by  starting  with  a larger 
training  sample,  but  this  result  is  sufficient  for  what  we  want  to  illustrate  in  this 
work. 

The  enhancement  and  efficiency  of  the  network  cut-off  depend  on  the  value 
chosen  for  Zcut,  where  the  network  enhancement  and  efficiency  are  defined  as  fol- 
lows: 

t ^ % of  signal  with  Znet  > Zcut 

% of  background  with  Znet  > Zcut 

Feff  = % of  signal  with  Znet  > Zcut 

The  overall  network  performance  can  be  characterized  by  the  single  curve  of 
the  network  enhancement  versus  the  network  efficiency  shown  in  Figure  6.6.  Each 
point  in  Figure  6.7  corresponds  to  a different  choice  for  the  network  cut-off  with  the 
lower  efficiencies  and  higher  enhancements  corresponding  to  larger  values  of  Zcut- 
In  the  analysis  presented  here,  we  choose  z^^t  = 0.75  which  for  the  training  sample 
corresponds  to  a relative  efficiency  of  about  42%  with  a relative  enhancement  of 
about  6. 

6.2.4  Network  Cut-Off 

We  now  analyze  an  independent  sample  of  events  using  the  trained  network  as 
a tool  to  help  distinguish  between  signal  and  background.  Figure  6.8  shows  the 
reconstructed  Higgs  mass  for  both  the  signal  and  background  events  that  have 
passed  the  “zero  level”  lepton  trigger,  the  jet-pair  selection  with  81  < Mjj(full)  < 


(6.13) 

(6.14) 


203 


Eiihancemeiit  Vs  EfReiency  I Net  - 9-t  IrEhl  (30S) 


400  GeV  Higgs  in  1 5 TeV  pp  collisions 


Network  with  pile-up 


Network  no  pile-up 


Fisher  Discriminates  no  pile-up 


Effidency 


14 

12 

wm 

10 

m 

S 

n 

8 

mm 

d 

6 

mm 

Ui 

4 

Figure  6.7:  Shows  the  enhancement  versus  the  efficiency  for  the  training  sample 
of  events  for  the  9-16-8-1  neural  network  with  305  memory  parameters.  Both  the 
ideal  case  where  only  one  event  at  a time  enters  the  detector  (no  pile-up)  and  for 
the  case  of  multiple  interaction  per  beam  crossing  (pile-up)  are  shown.  Each  point 
in  the  plot  corresponds  to  a different  choice  for  the  network  cut-off  with  the  lower 
efficiency  and  higher  enhancements  corresponding  to  larger  values  of  Zcut-  The 
network  enhancements  are  compared  with  the  enhancements  arrived  at  by  the  use 
of  Fisher  discriminates  (no  pile-up). 

101  GeV,  and  the  network  cut-off  (with  Zcut  = 0.75).  Now,  there  are  about  1,000 
Higgs  events  and  2, 000  QCD  background  events  per  year  within  50  GeV  of  the 
true  Higgs  mass  of  400  GeV.  This  corresponds  to  an  overall  enhancement  factor 
of  about  100  (see  Table  6.1)  with  an  overall  efficiency  of  about  10%.  This  shows 
that  the  signal  and  background  are  now  comparable.  Comparing  the  reconstructed 
Higgs  boson  mass  in  Figure  6.2  with  Figure  6.8  shows  the  added  enhancement  the 
neural  network  provides. 


204 


Events  per  year 
in  25  GeV  bin 
500  Y 
450  - 
400  - 

350  “ / 

300  - / 

250  - / 

200  - / 
150-  / 

100  -/ 

mj-  j 


Reconstructed  Higgs  Mass 


400  GeV  Higgs  in  1 5 TeV  pp  collisions 
81  < Mjj  < 1 01  GeV  no  pile-up 
After  network  cut 


Mass  (GeV) 


Figure  6.8:  Shows  the  reconstructed  mass  of  a 400  GeV  Higgs  boson  produced  in 
15  TeV  p-p  collisions.  The  plot  corresponds  to  the  number  of  events  per  year  (with 
Lum=10®/pb)  in  a 25  GeV  bin  for  the  H — >■  ZZ  signal  and  the  Z+jets  background 
for  the  ideal  case  where  only  one  event  at  a time  enters  the  detector  (no  pile-up). 
The  events  have  survived  the  “zero-level”  lepton  trigger  and  the  jet-pair  selection 
criterion  with  81  < M,j(full)  < 101  GeV  and  have  passed  the  network  cut-off  (i.e., 
have  Zjiet  > 0.75). 


6.2.5  Network  Weighting 

An  alternative  approach  to  using  the  network  cut-off  is  to  use  network  weight- 
ing. Here  one  weights  the  event  with  the  network  response,  Znet,  which  lies  between 
zero  and  one.  If  the  network  has  been  able  to  separate  signal  from  background 
then  signal  events  will  be  assigned  a weight  near  one  and  background  events  will 
be  assigned  a weight  near  zero. 

Figure  6.9  shown  the  network  weighted  reconstructed  Higgs  mass  for  both  the 
signal  and  background  events  that  have  passed  the  lepton  cuts,  the  jet-pair  selec- 
tion with  81  < Mjj(full)  < 101  GeV.  The  advantage  here  is  that  all  the  signal 


205 


Events  per  yeair 
in  25  GeV  birj 
1800  - 
is  1600  - 

JlTO -■  / 

1 1200  - / 
I 1000  - / 

^ 800  - / 

§ 600  «■/ 

» 400/ 

fl)  / 

200  - 


Reconstructed  Higgs  Mass 


400  GeV  Higgs  in  1 5 TeV  pp  collisions 
81  < Mjj  < 101  GeV  no  pile-up 
Network  weighted 


350  400  450  500 

Mass  (GeV) 


««#—Higg8->ZZ  Signal  *-o— Z* Jets  Background 


Figure  6.9:  Shows  the  reconstructed  mass  of  a 400  GeV  Higgs  boson  produced  in 
15  TeV  p-p  collisions  weighted  by  the  network  output,  z^et-  The  plot  corresponds 
to  the  weighted  number  of  events  per  year  (with  Lum=10^/pb)  in  a 25  GeV  bin  for 
the  H — )■  ZZ  signal  and  the  Z-l-jets  background  for  the  ideal  case  where  only  one 
event  at  a time  enters  the  detector  (no  pile-up) . The  events  have  survived  the  “zero- 
level”  lepton  trigger  and  the  jet-pair  selection  criterion  with  81  < M,j(full)  < 101 
GeV. 


events  are  used  (i.e.,  the  relative  efficiency  is  100%),  but  in  this  case  the  network 
cut-off  procedure  provides  a better  enhancement  of  the  signal. 


6.3  Fisher  Discriminates  Analysis  Without  Pile-Up 

As  already  discussed  in  Chapter  3,  Fisher  discriminates  are  another  method 
for  separating  signal  and  background.  Structurally,  Fisher  discriminates  can  be 
thought  of  as  a neural  network  with  one  single  neuron  and  the  identity  activation 
function.  The  performance  measure,  however,  attempts  to  separate  the  signal  and 
background  distributions. 


206 


In  this  case,  training  consists  of  calculating  the  Fisher  coefficients.  Once  this 
is  done  the  situation  is  similar  to  the  network.  For  each  input  of  Nin  variables 
there  is  one  output  F.  We  have  determined  the  Fisher  coefficients  for  the  same 
sample  of  signal  and  background  events  used  to  train  our  network  and  the  Fisher 
response  for  these  events  is  shown  in  Figure  6.10.  The  separation  between  signal 
and  background  is  not  as  good  as  with  the  network.  This  shows  that  the  application 
of  neural  networks  in  this  analysis  improves  the  final  result  and,  therefore,  it  is 
justified  to  add  this  additional  complexity. 

As  with  the  network,  the  overall  Fisher  performance  can  be  characterized  by  the 
single  curve  of  the  Fisher  enhancement  versus  the  Fisher  efficiency  which  is  shown 
in  Figure  6.7  together  with  the  network  performance.  Each  point  corresponds  to  a 
different  choice  for  the  Fisher  cut-off.  The  best  that  can  be  done  with  the  Fisher 
method  is  an  enhancement  of  about  2,  whereas  the  network  enhancements  are 
much  higher. 


6.4  Data  Generation  and  Cuts  With  Pile-Up 

We  now  consider  the  case  of  multiple  interactions  per  beam  crossing.  ISAJET 
is  used  to  generate  Npu^  minimum  bias  events  along  with  each  Higgs— )•  ZZ  signal 
and  each  Z-|-jets  background  event.  The  number  of  pile-up  interactions  per  beam 
crossing,  NpUe,  that  enter  the  calorimeter  is  generated  according  to  a Poisson  dis- 
tribution with  a mean  of  about  29  minimum  bias  collisions  for  each  Higgs  boson 
or  Z-t-jets  event  as  shown  in  Figure  6.11.  The  mean  of  29  collisions  per  beam 
crossing  was  arrived  at  by  using  a bunch  crossing  time  of  25  ns,  a peak  luminosity 
of  10^‘^cm“^sec~^,  and  the  ISAJET  minimum  bias  cross  section  at  15TeV  of  116 


207 


Figure  6.10:  Shows  the  Fisher  response,  F,  for  the  sample  of  signal  and  background 
events  used  in  the  training  of  the  neural  network.  The  plot  corresponds  to  the 
percentage  of  events  with  F within  a 0.3  bin  for  the  H ZZ  signal  and  the 
Z+jets  background  for  the  ideal  case  where  only  one  event  at  a time  enters  the 
detector  (no  pile-up).  The  events  have  survived  the  “zero-level”  lepton  trigger  and 
the  jet-pair  selection  criterion  and  have  81  < Mjj(full)  < 101  GeV. 


mb.  Our  mean  number  is  slightly  larger  than  the  20  collisions  per  beam  crossing 
quoted  for  the  LHC. 

These  pile-up  interactions  greatly  increase  the  particle  multiplicity  and  the 
global  transverse  energy  of  each  event.  Nevertheless,  they  do  not  affect  the  lepton 
trigger.  Table  2 shows  that,  as  before,  roughly  10, 000  Higgs  boson  and  about  2 
million  background  events  per  year  pass  the  “zero  level”  lepton  trigger. 

Events  are  again  analyzed  by  dividing  the  solid  angle  into  “calorimeter”  cells 
having  size  ArjAcp  — 0.2  x 15°,  but  in  this  case  we  ignore  all  cells  with  Ft  < 1 GeV. 
This  is  done  to  reduce  the  number  of  non-zero  cells  which  saves  time  and  improves 
the  jet  algorithm.  Jets  are  defined  as  before,  but  the  definition  of  a “hot”  cells  is 


208 


Figure  6.11:  Generated  number  of  minimum  bias  interactions  per  beam  crossing, 
these  events  enter  the  calorimeter  together  with  one  H ^ ZZ  signal  event  or  one 
Z+jets  background  event  to  simulate  the  case  of  multiple  interactions  per  beam 
crossing  (pile-up). 

changed  to  10  GeV.  This  means  that  the  minimum  jet  transverse  energy  is  now 
lOGeV  (compared  to  5 GeV  in  the  analysis  without  pile-up). 

Except  for  these  changes,  the  jet-pair  selection  is  done  as  before  with  similar 
results.  Table  2 shows  that  of  the  10, 000  signal  events  passing  the  “zero  level” 
lepton  cut  about  50%  also  pass  the  jet-pair  selection  criterion.  Also,  about  30%  of 
the  ordinary  Z+jets  background  events  that  survive  the  “zero  level”  lepton  trigger 
have  a jet-pair  that  meets  the  selection  criterion. 

The  jet-jet  invariant  mass  for  the  signal  and  background  events  that  have  passed 
the  “zero-level”  lepton  trigger  and  the  jet-pair  selection  criterion  is  shown  in  Figure 
6.1.  Gomparison  with  the  no  pile-up  case  shows  that  the  Z mass  peak  has  shifted 
up  about  20  GeV  and  become  somewhat  broader.  This  is,  of  course,  due  to  the 


209 


Table  6.2:  400  GeV  Higgs  boson  produced  in  15  TeV  p-p  collisions.  The  table 
shows  the  number  of  events  per  year  (with  Lum=10^/pb)  for  the  H ZZ  signal 
and  Z+jets  background  for  the  case  of  multiple  interactions  per  beam  crossing 
(i.e.,  with  pile-up).  The  “zero-level”  lepton  trigger  is  used  as  a reference  point  and 
is  normalized  to  100%.  The  enhancement  factor  is  defined  to  be  the  percentage  of 
signal  divided  by  the  percentage  of  background  surviving  the  given  set  of  cuts. 


Selection 

Signal 
H ^ ZZ 

Background 
Z-h  jets 

Back/ 

Enhancement 

Factor 

Cut 

% 

Overall 

Events/ 

year 

% 

Overall 

Events/ 

year 

Sig 

Relative 

Overall 

Lepton  trigger 
PT(I)  > 25  GeV 
PT{II)  > 100  GeV 

100% 

10212 

100% 

1973919 

193 

1.0 

1.0 

Jet  pair  selection 
ET{j)  > 25  GeV 
> 100  GeV 

53.3% 

5440 

33.6% 

662850 

122 

1.6 

1.6 

Z-mass  cut 
Sl<  Mz  < 101  GeV 

19.3% 

1973 

2.3% 

44693 

23 

5.4 

8.5 

H-mass  cut 
350  <Mh  < 450  GeV 

14.6% 

1489 

0.7% 

13615 

9.1 

2.5 

21.1 

Z-mass  h net  cut 
81  < Mz  < 101  GeV 
net  cut  > 0.75 

6.8% 

696 

0.2% 

3230 

4.6 

4.9 

41.7 

H-mass  & net  cut 
350  <Mh  < 450  GeV 
net  cut  > 0.75 

5.6% 

568 

0.1% 

1525 

2.7 

3.4 

72.0 

pile-up  interactions  which  have  contributed  transverse  energy  and  mass  to  the  jet- 
pair.  Rather  than  trying  to  subtract  out  this  effect,  we  simply  shift  our  jet-jet 
mass  cut  to 


100  < Mjj(full)  < 120, 


(6.15) 


where  Mjj(full)  is  defined  as  before  with  i?jj(full)  = 0.6.  As  before,  the  invariant 
mass  of  the  jet-pair  is  used  only  in  the  selection  of  events,  the  Higgs  mass  is 
reconstructed  from  the  momentum  of  the  jet-pair  with  Mjj  set  equal  to  M^.  As 


210 


Figure  6.12:  Shows  the  multiplicity  of  jets  for  400  GeV  Higgs  bosons  produced  in 
15  TeV  p-p  collisions.  The  plot  corresponds  to  the  percentage  of  events  with  N jets 
with  Et  greater  than  10  GeV  for  the  H — > ZZ  signal  and  the  Z+jets  background 
for  the  case  of  multiple  interactions  per  beam  crossing  (pile-up).  The  events  have 
survived  the  “zero-level”  lepton  trigger  and  the  jet-pair  selection  criterion  with 
100  < < 120  GeV. 

can  be  seen  from  Table  2,  in  this  case  about  36%  of  the  Higgs  boson  signal  passing 
both  the  “zero-level”  lepton  cut  and  the  jet-pair  selection  criterion  have  Mjj  within 
this  range,  which  is  slightly  less  than  the  51%  for  the  no  pile-up  case.  About  7%  of 
the  Z-l-jets  background  events  surviving  both  the  “zero-level”  lepton  cut  and  the 
jet-pair  selection  criterion  have  a full  jet-pair  invariant  mass  in  this  range,  which  is 
about  the  same  as  the  no  pile-up  case.  This  corresponds  to  an  overall  enhancement 
factor  at  this  stage  of  about  8 with  an  overall  efficiency  of  about  19%,  which  is 
slightly  worse  than  the  no  pile-up  case. 


At  this  stage.  Table  6.2  shows  that  there  are  about  1,500  Higgs  events  and 
14, 000  background  events  per  year  within  50  GeV  of  the  true  Higgs  mass  that 


211 


pass  the  “zero-level”  lepton  trigger,  the  jet-pair  selection  criterion,  and  have  100  < 
M„(full)  < 120  GeV.  This  corresponds  to  an  overall  enhancement  factor  of  about 
21  with  an  overall  efficiency  of  about  15%.  With  this  enhancement,  the  Z-l-jets 
background  is  roughly  9 times  the  signal.  At  this  stage,  we  apply  a neural  network 
to  improve  the  signal  to  background  ratio  beyond  what  can  be  achieved  with  these 
standard  cuts. 


6.5  Network  Analysis  With  Pile-Up 

We  use  the  same  nine  variables  to  characterize  the  events,  but  since  these 
variables  have  changed  dramatically,  the  network  must  be  retrained.  Figure  6.12 
shows  the  new  jet  multiplicities.  Figure  6.13  and  Figure  6.14  show  that  the  fraction 
of  transverse  energy  and  mass,  respectively,  originating  in  the  extended  region, 
0.6  < Rjj  < 1.0,  has  greatly  increased  for  both  the  signal  and  background  events 
due  to  the  pile-up.  Nevertheless,  there  are  still  slight  differences  between  signal 
and  background  that  the  network  can  use  to  distinguish  between  the  two. 

The  9-16-8-1  network  is  retrained  on  sample  of  2, 741  signal  and  3, 566  back- 
ground events  that  include  the  pile-up  interactions.  Both  signal  and  background 
events  have  already  satisfied  the  “zero-level”  lepton  cuts,  the  jet-pair  selection, 
and  have  100  < M„(tull)  ^ 120  Ge\^ . Tb  get  this  training  sample  it  v^as  neces- 
sary to  generate  40, 000  Higgs  boson  events  with  pile-up  and  400, 000  Z+jet  events 
with  pile-up.  Running  with  pile-up  is  a lot  slower  since  a large  number  of  events 
enter  the  calorimeter  during  each  beam  crossing.  Because  of  this  we  are  using  a 
very  small  training  sample.  We  could  do  better  with  a larger  sample,  but  this  is 
sufficient  for  what  we  want  to  illustrate  in  this  work.  After  training,  we  achieve 


212 


Figure  6.13:  Shows  the  fraction  of  transverse  energy  coming  from  calorimeter  cells 
within  the  “halo”  region  0.6  < Rjj  <1.0  surrounding  either  of  the  away-side  jets. 
The  plot  corresponds  to  the  percentage  of  events  with  the  jet-jet  transverse  energy 
fraction  within  the  0.025  bin  for  the  H ZZ  signal  and  the  Z-f-jets  background 
for  the  case  of  multiple  interactions  per  beam  crossing  (pile-up).  The  events  have 
survived  the  “zero-level”  lepton  trigger  and  the  jet-pair  selection  criterion  and  have 
100  < < 120  GeV. 

E = 0.1797  with  a network  response  for  the  training  events  shown  in  Figure  6.15. 
Figure  6.15  also  shows  the  network  response  (i.e.,  Znet)  for  an  independent  sample 
of  signal  and  background  events  not  used  in  the  training.  In  spite  of  the  small 
training  sample,  the  network  generalizes  fairly  well. 

The  network  performance  for  the  training  sample  is  shown  in  Figure  6.7  to- 
gether with  the  no  pile-up  case.  Again  we  choose  a network  cut-olf,  Zcut  of  0.75, 
which  in  this  case  for  the  training  sample  corresponds  to  a relative  enhancement 
of  about  6 with  a relative  efficiency  of  about  38%. 

As  in  the  case  without  pile-up,  we  analyze  an  independent  sample  of  signal  and 
background  events  with  pile-up.  Figure  6.16  shows  the  reconstructed  Higgs  mass 


213 


Figure  6.14:  Shows  the  fraction  of  invariant  mass  coming  from  calorimeter  cells 
within  the  “halo”  region  0.6  < Rjj  < 1.0  surrounding  either  of  the  away-side  jets. 
The  plot  corresponds  to  the  percentage  of  events  with  the  jet-jet  invariant  mass 
fraction  within  the  0.05  bin  for  the  H ZZ  signal  and  the  Z-l-jets  background 
for  the  case  of  multiple  interactions  per  beam  crossing  (pile-up).  The  events  have 
survived  the  “zero-level”  lepton  trigger  and  the  jet-pair  selection  criterion  and  have 
100  < < 120  GeV. 

for  both  the  signal  and  background  events  that  have  passed  the  lepton  cuts,  the  jet- 
pair  selection  with  100  < Mjj(full)  < 120  GeV,  and  the  network  cut-off  (with  Zcut  — 
0.75).  Now,  there  are  about  600  Higgs  events  and  1500  QCD  background  events 
per  year  within  50  GeV  of  the  true  Higgs  mass  of  400  GeV.  This  corresponds  to  an 
overall  enhancement  factor  of  about  72  (see  Table  6.2)  with  an  overall  efficiency  of 
about  6%.  Although  the  results  are  not  quite  as  good  as  the  no  pile-up  case,  signal 
and  background  are  again  roughly  comparable  and  the  network  has  improved  the 
signal  to  background  ratio  by  about  a factor  of  4. 


214 


Figure  6.15:  Shows  the  network  response,  Zmu  for  the  sample  of  signal  and 
background  events  used  in  the  training  and  for  an  independent  sample  of  sig- 
nal and  background  events.  The  plot  corresponds  to  the  percentage  of  events  with 
^net  within  a 0.05  bin  for  the  H -4  ZZ  signal  and  Z+jets  background  for  the 
case  of  multiple  interactions  per  beam  crossing  (pile-up).  The  events  have  sur- 
vived the  “zero-level”  lepton  trigger  and  the  jet-pair  selection  criterion  and  have 
100  < < 120  GeV. 


6.6  Summary 


This  analysis  shows  that  neural  networks  are  useful  tools  in  Higgs  boson  phe- 
nomenology. They  can  further  enhance  the  signal  in  a way  that  cannot  be  achieved 
with  traditional  methods  such  as  Fisher  discriminates,  for  example.  We  were  able 
to  obtain  an  overall  signal  to  background  enhancement  of  around  10  with  the 
standard  Higgs  boson  cuts.  The  neural  network,  however,  provides  an  additional 
enhancement  of  4-5  beyond  what  can  be  achieved  with  the  standard  data  cuts 
resulting  in  an  overall  enhancement  of  about  50. 


215 


Figure  6.16:  Shows  the  reconstructed  mass  of  a 400  GeV  Higgs  boson  produced  in 
15  TeV  p-p  collisions.  The  plot  corresponds  to  the  number  of  events  per  year  (with 
Lum=10^/pb)  in  a 25  GeV  bin  for  the  H — > ZZ  signal  and  the  2+jets  background 
for  the  case  of  multiple  interactions  per  beam  crossing  (pile-up).  The  events  have 
survived  the  “zero-level”  lepton  trigger  and  the  jet-pair  selection  criterion  with 
100  < < 120  GeV  and  have  passed  the  network  cut-off  (i.e.,  have  Znet  > 

0.75). 

In  addition,  this  method  works  even  with  a large  number  of  interactions  per 
beam  crossing.  This  shows  that  some  jet  physics  can  be  done  even  in  the  large 
pile-up  environment  of  LHC.  Although  this  work  is  not  a detailed  simulation, 
experiments  at  the  LHC  should  be  able  to  do  as  well  or  better  than  our  analysis. 
Furthermore,  this  procedure  can  be  applied  to  W bosons  and  should  help  enhance 
the  Higgs-^  WW  — > iujj  signal  at  hadron  colliders  as  well. 


CHAPTER  7 

THE  FOUR-JET  DECAY  MODE  OF  TOP 


The  top  quark,  as  predicted  by  the  Standard  Model,  is  the  weak-isospin  partner 
of  the  bottom  quark.  It  is  the  heaviest  quark  with  a mass  of  about  rrit  = 175  GeV 
and  for  that  reason  it  has  been  discovered  only  recently  by  the  CDF [30,  31]  and 
D0[32,  33]  collaborations. 

The  top  quark  decays  into  a 6-quark  and  a W boson,  t hW . The  W boson 
decays  into  a lepton  (e  or  //)  and  a neutrino  about  22%  (2/9)  of  the  time  and 
into  a quark-antiquark  pair  about  67%  (6/9)  of  the  time.  This  implies  that  when 
top-pairs  are  produced  in  hadron-hadron  collisions,  pp  ti  + X,  both  of  the  W 
bosons  decay  into  a lepton  and  neutrino  only  about  5%  of  the  time  resulting  in 
the  final  state  consisting  of  two  leptons,  two  neutrinos,  and  two  b-quarks  {iiuubb). 
This  distinctive  final  state  constitutes  the  “discovery”  mode  of  the  top  quark  at 
hadron  colliders[30,  33].  On  the  other  hand,  it  is  considerably  more  likely  for  one 
of  the  W bosons  to  decay  into  a quark-antiquark  pair  resulting  in  a final  state 
consisting  of  a lepton,  a neutrino,  and  a bb  and  a qq  pair.  The  Eubbqq  mode  shown 
in  Figure  7.1  occurs  about  35%  of  the  time  or  about  7 times  more  often  than 
the  purely  leptonic  mode.  The  backgrounds  are  larger  for  this  decay  mode,  but 
so  is  the  signal.  When  each  of  the  four  outgoing  quarks  produce  a distinct  jet, 
then  the  resulting  event  contains  a lepton,  a neutrino,  and  four  jets  {£i^jjjj).  This 
decay  mode  is  used  to  analyze  the  properties  of  the  top  quark  in  more  detail  and 
to  determine,  for  example,  the  top  mass.  The  purely  hadron  six  jet  decay  mode 


216 


217 


Figure  7.1:  Illustration  of  top-pair  production  in  p-p  collisions  in  which  one  of  the 
W bosons  decay  leptonically  and  the  other  decays  hadronically  resulting  in  a final 
state  consisting  of  a lepton,  a neutrino,  bb,  and  a qq  pair. 


218 


occurs  about  60%  of  the  time,  but  it  is  completely  buried  underneath  “ordinary” 
QCD  multijet  production. 

In  this  chapter,  we  concentrate  on  the  iubbqq  decay  mode  of  top-pair  production 
in  proton-antiproton  collisions  at  1.8  TeV  and  investigate  ways  to  optimize  this 
signal  over  the  backgrounds.  The  event  topology  of  the  signal  is  shown  in  Figure 
7.2  and  consists  of  a lepton,  a neutrino,  and  four  outgoing  quarks  which  manifest 


Figure  7.2:  Shows  the  event  topology  for  the  top-pair  signal.  If  each  outgoing 
quark  produces  a distinct  jet  then  the  final  state  contains  a lepton,  a neutrino 
(missing  Et),  and  four  jets. 

themselves  as  “jets”.  In  the  center-of-mass  of  a 175  GeV  top  quark,  the  W boson 
and  6-quark  decay  products  each  have  a momentum  of  about  70  GeV.  Furthermore, 
in  the  center  of  mass  frame  of  the  W boson,  the  quark  and  antiquark  decay  products 
each  have  a momentum  of  about  40  GeV.  The  top-pair  are  produced  near  threshold 
resulting  in  a typical  event  that  is  rather  spherical  in  shape  with  all  six  decay 
products,  ivbbqq,  having  large  transverse  energy.  The  background  comes  from 


219 


Figure  7.3:  Illustrates  a,  W + jets  background  process  to  the  top  pair  production 
in  the  pp  collisions. 

the  “ordinary”  QCD  production  of  large  transverse  momentum  W bosons  plus 
multiple  jets  as  shown  in  Figure  7.3  and  from  the  production  of  6-quark  pairs  plus 
associated  jets  as  illustrated  in  Figure  7.4. 

Our  analysis  is  an  extension  of  the  Higgs  analysis  presented  in  the  previous 
chapter.  We  still  apply  the  standard  lepton  plus  missing  transverse  energy  cuts 
since  the  signal  events  have  a leptonic  component.  However,  we  will  also  use 
the  modified  Fox- Wolfram  moments  defined  in  Chapter  3 to  describe  the  overall 
topology  of  signal  and  background  events.  We  will  then  apply  neural  networks  and 
Fisher  discriminates[7,  34,  35]  directly  to  the  modified  Fox- Wolfram  coefficients. 
This  obviates  the  need  to  select  some  specific  input  variables  that  may  or  may  not 
be  sufficient  to  achieve  “maximum”  separation  between  signal  and  background. 


220 


Figure  7.4:  Illustrates  a,  bb  + jets  background  process  to  the  top  pair  production 
in  the  pp  collisions. 

7.1  Data  Generation  and  Cuts 
7.1.1  Event  Generation 

ISAJET  version  7.06  is  used  to  generate  top  quarks  with  a mass  of  175  GeV 
in  1.8  TeV  proton-antiproton  collisions.  At  this  energy,  175  GeV  top-pairs  are  pro- 
duced via  quark-antiquark  annihilation,  qq  tt,  about  88%  of  the  time  and  by 
gluon-gluon  fusion,  gg  tt,  the  remaining  12%.  We  refer  to  this  as  the  ’’signal”. 
We  have  normalized  the  top  cross  section  to  be  7.5  pb  corresponding  to  750  events 
with  an  integrated  luminosity  of  100/pb.  The  “background”  consists  of  single  W 
boson  events  generated  with  the  hard-scattering  transverse  momentum,  kr,  greater 
than  25  GeV . Single  W bosons  are  produced  at  large  transverse  momentum  via  the 
“ordinary”  QGD  subprocesses  qg  -)•  Wq,  qg  Wq,  and  qq^Wg.  These  subpro- 


221 


cesses,  of  course,  generate  additional  gluons  via  bremsstrahlung  off  both  incident 
and  outgoing  color  non-singlet  partons,  resulting  in  multiparton  final  states  which 
subsequently  fragment  into  hadrons.  This  is  referred  to  as  the  IT-f-jets  background. 
Another  background  is  b-quark  pairs  produced  via  the  subprocess  qq  bb  and  the 
accompanying  radiation.  This  is  referred  to  as  the  66-|-jets  background. 

We  do  not  attempt  to  do  a detailed  simulation  of  the  CDF  or  DO  detector. 
Events  are  analyzed  by  dividing  the  solid  angle  into  “calorimeter”  cells  having  size 
Ar]A(f)  = 0.1  X 7.5°,  where  rj  and  0 are  the  pseudorapidity  and  azimuthal  angle, 
respectively.  Our  simple  calorimeter  covers  the  range  |ry|  < 4 and  has  3840  cells. 
A single  cell  has  an  energy  (the  sum  of  the  energies  of  all  the  particles  that  hit  the 
cell  excluding  neutrinos)  and  a direction  given  by  the  coordinates  of  the  center  of 
the  cell.  Prom  this  the  transverse  energy  of  each  cell  is  computed  from  the  cell 
energy  and  direction.  We  have  taken  the  energy  resolution  to  be  perfect,  which 
means  that  the  only  resolution  effects  are  caused  by  the  lack  of  spatial  resolution 
due  to  the  cell  size.  Large  transverse  momentum  leptons  are  analyzed  separately 
and  are  not  included  when  computing  the  energy  of  a cell. 


7.1.2  Lepton  Plus  Missing  Transverse  Energy  Trigger 

The  “zero-level”  trigger  is  designed  to  select  large  transverse  momentum  W 
bosons  that  have  decayed  into  a charged  lepton  and  a neutrino.  This  first  cut  is 
made  by  demanding  that  the  event  contain  at  least  one  isolated  high  transverse 
momentum  charged  lepton  or  /i^)  in  the  central  region  satisfying: 


> 15GeV,  \v{e^)\  < 2.5. 


(7.1) 


222 


“Isolated”  leptons  are  defined  by  demanding  that  the  total  transverse  energy 
within  a distance  Ri  of  the  lepton  in  r}-<j)  space  be  less  than  £'y(max).  For  this 
analysis, 

= 0.2  , £'y(max)  = 5GeV.  (7.2) 

In  addition,  the  event  must  have  large  missing  transverse  energy,  Et,  and  an 
overall  lepton- neutrino  transverse  momentum,  given  by 

> 20  GeV  , Priiu)  > 25  GeV  (7.3) 

where  the  missing  transverse  momentum  2- vector,  is  determined  from  the  trans- 

verse energy  grid  (i.e.,  the  calorimeter)  and 


CM 

= 

(7.4) 

P^{iu) 

= (pi  + + {pI  + y 

.(7.5) 

where  the  i-axis  and  y-axis  are  perpendicular  to  the  colliding  beams  and  the  z-axis 
is  parallel. 

This  selection  of  Pt{^^)  > 15  GeV,  fJr  > 20  GeV,  and  Pt{£u)  > 25  GeV  is 
referred  to  as  the  lepton  plus  missing  Et  trigger.  Table  7.1  shows  that  about  165 
top-pair  events  survive  this  selection  criterion  (22%  of  the  total  top  signal),  for 
illustration,  we  take  the  integrated  luminosity  to  be  100/pb.  Table  7.1  also  shows 
that  about  7, 000  W -l-jets  and  500  66-|-jets  background  events  also  survive  this  cut. 
The  lepton  isolation  cuts  do  a good  job  removing  most  of  the  66-t-jets  background. 


223 


Table  7.1:  175  GeV  top  quark  pairs  produced  in  1.8  TeV  p-p  collisions.  The  table 
shows  the  number  of  events  (with  Lum  = 100/pb)  for  the  top-pair  signal  and  the 
IT-l-jets  background. 


Selection 

Cut 

Signal 

Top  (175  gev) 

Background 
W jets 

sig/ 

back 

Enhancement 

Factor 

% 

Remain  Events 

% 

Remain  Events 

Relative  Overall 

Pt{1)  > 15  Gev 
^^(miss)  > 20  GeV 
> 25  GeV 

100%  165 

100%  7044 

0.0234 

1.0  1.0 

AT(cell)  > 7 
Et  > 5 GeV 
Et  (sum)  > 100  GeV 

69.0%  113 

0.7%  49 

2.3 

99.7  99.7 

Fisher  cut 
F > 0.75  GeV 

30%  49 

0.1%  6 

8.7 

3.7  373.0 

so  we  will  concentrate  primarily  on  the  lT-|-jets  background.  At  this  stage,  the 
background  is  about  43  times  the  signal. 

In  order  to  quantify  how  various  additional  cuts  enhance  the  signal  above  the 
background,  we  use  the  enhancement,  F^nhi  and  efficiency,  i^e//)  defined  in  Chapter 
3. 

We  also  define  the  “zero-level”  trigger  to  be  the  reference  point  and  the  fraction 
of  events  escaping  this  cut  is  set  to  100%  in  Table  7.1.  Similarly,  all  “enhance- 
ment” factors  are  set  to  one  at  this  level  as  we  measure  the  effectiveness  of  all 
other  additional  cuts  from  this  point.  The  overall  enhancement  and  efficiency  are 
determined  by  examining  the  number  of  events  before  and  after  the  particular  cut. 


7.1.3  Calorimeter  Cell  Cuts 

At  this  stage  in  the  analysis  one  normally  demands  that  the  event  contain  at 
least  four  jets[32,  31].  Cutting  on  the  number  of  jets  is  a way  to  preferentially 
select  the  top-pair  signal  over  the  background.  However,  we  have  found  that  it 


224 


Figure  7.5:  Shows  the  multiplicity  of  calorimeter  cells  containing  at  least  5 GeV  of 
transverse  energy  for  the  top-pair  signal  and  the  VF 4-jets  background.  In  all  cases 
the  events  have  survived  the  “zero-level”  lepton  plus  missing  Et  trigger.  The  plot 
shows  the  percentage  of  events  with  N cells  with  ET{cell)  > 5 GeV.  The  position 
of  our  cell  cut  is  marked  by  the  dotted  line. 

is  faster  and  better  to  simply  cut  on  the  number  of  calorimeter  cells,  Nceii,  with 
transverse  energy  greater  than  some  minimum,  ^'^^^(min).  Figure  7.5  shows  the 
cell  multiplicity  with  F^“**(min)  = 5 GeV  for  the  top-pair  signal  and  the  kF-fjets 
background.  On  the  average,  the  top-pair  signal  populates  a larger  number  of  cells 
than  does  the  background.  Obviously  this  is  because  the  top-pair  signal  produces 
more  jets.  However,  one  does  not  have  to  define  a “jet”  to  see  this  topology.  The 
top-pair  signal  produces  transverse  energy  flying  out  in  all  directions  and  this  can 
be  seen  directly  from  the  calorimeter  cell  multiplicity. 

The  top-pair  signal  also  produces  more  global  transverse  energy  than  the  back- 
ground. This  is  shown  in  Figure  7.6,  where  we  define  £'7’(sum)  to  be  the  sum  of 


225 


Figure  7.6:  Shows  the  total  transverse  energy  of  all  the  calorimeter  cells  with 
E'r{cell)  > 5 GeV  for  the  top-pair  signal  and  the  14^-l-jets  background. In  all  cases 
the  events  have  survived  the  “zero-level”  lepton  plus  missing  Et  trigger.  The  plot 
shows  the  percentage  of  events  with  ^'^(sum)  within  a 25  GeV  bin.  The  position 
of  our  cell  cut  is  marked  by  the  dotted  line. 

the  transverse  energy  of  all  the  calorimeter  cells  with  Et  > ^'^''(min).  As  shown 
in  Figure  7.5  and  Figure  7.6,  we  make  the  following  calorimeter  cell  cuts: 


Nceii  > 8 with  E'y^^(min)  = 5 GeV  and  £'7’(sum)  > 100  GeV.  (7.6) 


At  this  stage  one  could  cut  harder  on  £'7'(sum)  and  remove  more  background. 
However,  we  want  to  avoid  as  much  as  possible  cuts  that  cause  the  background 
invariant  mass  to  peak  at  the  same  place  as  the  top-pair  signal.  For  this  reason, 
we  will  use  event  “shape”  variables  to  further  improve  the  signal  to  background 


ratio. 


226 


Table  7.1  shows  that  of  the  165  top-pair  events  passing  the  “zero-level”  lepton 
plus  missing  Et  cut  roughly  69%  also  pass  the  calorimeter  cell  cuts.  On  the  other 
hand,  less  than  1%  of  the  VF-l-jets  background  events  survive  the  cell  cuts.  The 
calorimeter  cell  cuts  produce  an  enhancement  of  0.69/0.007  or  about  100  over  the 
W -l-jets  background  with  69%  efficiency,  resulting  in  a signal  to  background  ratio 
of  about  2.  The  Nceii  > 8 with  ^'“^^(min)  = 5GeV  cut  produces  more  than  a 
factor  of  two  better  enhancement  than  the  traditional  “jet  cuts”  (i.e.,  Njet  > 4). 
Adding  the  £^r(sum)  > 100  GeV  cut  gives  an  additional  relative  enhancement  of 
more  than  a factor  of  3. 

V 

7.2  Reconstructing  the  Top-Pair  Invariant  Mass 

7.2.1  Reconstructing  the  Neutrino  Momentum 

Ideally  one  would  like  to  reconstruct  the  invariant  mass  of  the  top-pair  from 

its  decay  products:  lepton,  neutrino,  and  four  jets.  However,  the  neutrino  is  not 

detected  and  its  presence  must  be  inferred  by  examining  the  missing  transverse 

— ♦ 

momentum,  If  we  set  the  transverse  momentum  components  of  the  neutrino 
equal  to  the  missing  transverse  momentum. 


and  assume  that  the  charged  lepton  and  the  neutrino  are  the  result  of  a IT  decay 
(and  neglect  the  W width)  then  the  longitudinal  momentum  of  the  neutrino  is 


227 


Figure  7.7:  Shows  the  multiplicity  of  jets  with  transverse  energy  greater  than  15 
GeV  for  the  top-pair  signal  and  W -l-jets  background.  In  all  cases  the  events  have 
survived  the  “zero-level”  lepton  plus  missing  Et  trigger  and  the  calorimeter  cell 
cuts.  The  plot  shows  the  percentage  of  events  with  N jets  with  Erijet)  > 15  GeV. 


given  by  one  of  the  two  solutions: 


where  Ei,  and  are  the  energy,  longitudinal  momentum,  and  transverse 
momentum,  respectively,  of  the  charged  lepton,  and  is  the  transverse  momentum 
of  the  neutrino.  The  quantity  A is  given  by 


A = -t-  2p^  • p/  = -f-  PyPr  cos  <j) 


(7.9) 


228 


where  (f)  is  the  azimuthal  angle  between  the  transverse  momentum  vector  of  the 
charged  lepton  and  the  neutrino.  We  include  both  solutions  in  our  determination 
of  the  top-pair  invariant  mass. 

7.2.2  Adding  in  the  Momentum  and  Energy  of  the  Jets 

So  far,  we  have  not  used  jets  in  our  event  trigger.  However,  we  do  use  jets  to 
reconstruct  the  top-pair  invariant  mass.  In  addition,  we  use  the  jet  topology  to 
help  further  distinguish  the  signal  from  the  backgrounds.  Jets  are  defined  using  a 
simple  algorithm.  One  first  considers  the  “hot”  cells  (those  with  transverse  energy 
greater  than  5 GeV).  Cells  are  combined  to  form  a jet  if  they  lie  within  a specified 
“radius”  /S.cjp'  in  rj-cf)  space  from  each  other.  Jets  have  an  energy  given 

by  the  sum  of  the  energy  of  each  cell  in  the  cluster  and  a momentum  pj  given 
by  the  vector  sum  of  the  momentums  of  each  cell.  The  invariant  mass  of  a jet  is 
simply  Mj  — Ej  — pj  ■ pj.  In  this  analysis,  we  take  the  jet  radius  to  be  Rj  — 0.4 
and  require  jets  to  have  at  least  15  GeV  of  transverse  energy.  Namely, 

Rj  — 0.4  and  Erijet)  > 15  GeV.  (7-10) 

The  top-pair  invariant  mass,  Mti,  is  constructed  from  the  energy  and  momen- 
tum of  the  charged  lepton,  the  energy  and  momentum  of  the  reconstructed  neu- 
trino, and  the  overall  momentum  vector  of  the  associated  jets  as  follows: 


~ + El,  + Ejets)^  — {Pl  + Pi/  + PjetsY 


(7.11) 


229 


where 


jets 

Pjets  ~ ^ , Pi ) 
i 


(7.12) 


and 


(7.13) 


The  overall  jet  energy,  Ejets,  and  momentum,  Pjets,  is  constructed  by  summing 
over  all  jets  with  transverse  energy  greater  than  15  GeV.  We  do  not  require  the 
event  to  have  a minimum  of  four  jets.  The  calorimeter  cell  cuts  have  replaced  the 
need  to  make  a jet  multiplicity  cut.  This  can  be  seen  in  Figure  7.7  which  shows  the 
multiplicity  of  jets  with  Erijet)  > 15  GeV  for  the  top-pair  signal  and  the  VF-l-jets 
background  after  the  lepton  plus  missing  Et  trigger  and  the  calorimeter  cell  cuts. 
The  cell  cuts  have  forced  the  signal  and  background  jet  multiplicities  to  look  similar 
and  one  does  not  gain  much  by  making  an  additional  jet  multiplicity  cut.  (At  this 
stage,  requiring  Njet  > 4 would  result  in  an  additional  relative  enhancement  of 
about  2 with  an  efficiency  of  81%.) 


7.2.3  Comparing  With  the  Parton-Parton  Center  of  Mass  Energy 

The  top-pair  invariant  mass.  Mu,  corresponds  to  the  center-of-mass  energy, 
Ecm,  of  the  underlying  parton-parton  two-to-two  subprocess  which  has  a threshold 
at  twice  the  mass  of  the  top  quark,  Ecm  > ‘^Mtop-  This  is  seen  clearly  in  Figure  7.8 

_ ___  y\ 

which  compares  the  true  qq  tt  and  gg  tt  CM  energy,  Ecm  (not  experimentally 
observed),  with  the  reconstructed  top-pair  invariant  mass.  Mu-  If  the  neutrino 
momentum  could  be  precisely  determined  from  the  missing  Et  and  if  we  knew 
exactly  which  particles  to  include  in  the  jets  then  the  two  curves  in  Figure  7.8 


230 


Top-Pair  Invariant  Mass 


175GeVTop  in  1 .8  TeV  PbarP 
Collisions  Aftsr  “zero-level"  Lepton 
missing  ET  Trigger,  and  Cell  Cuts 


Reconstructed 
Invariant  Mass 


Mass  or  Energy  (GeV) 


Figure  7.8:  Shows  the  reconstructed  top-pair  invariant  mass,  Mtt,  for  175  GeV 
top  quarks  produced  in  1.8  TeV  pp  collisions  (solid  curve).  The  plot  contains  only 
the  top-pair  signal  and  corresponds  to  the  number  of  events  per  year  (with  Lum= 
100/pb)  in  a 50  GeV.  The  events  have  survived  the  “zero-level”  lepton  plus  missing 
Et  trigger  and  the  calorimeter  cell  cuts.  Also  shown  in  the  true  parton-parton  GM 
energy  of  the  event  (not  directly  observable  experimentally). 


would  agree.  Although  one  cannot  precisely  reconstruct  the  parton-parton  GM 
energy,  there  still  remains  a nice  peak  in  the  reconstructed  top-pair  invariant  mass 
at  twice  the  top  mass.  We  can  use  the  observation  of  this  peak  as  a measure  of 
how  well  one  can  determine  the  top  quark  mass  and  we  would  like  to  remove  as 
much  background  as  possible  from  the  peak. 

Figure  7.8  includes  only  the  top-pair  signal  with  no  background.  Figure  7.9 
shows  the  reconstructed  parton-parton  CM  energy  for  the  top-pair  signal  and  the 
W-f-jets  background  after  the  “zero  level”  lepton  plus  missing  Et  trigger  and  the 
calorimeter  cell  cuts.  The  plot  shows  the  sum  of  the  signal  and  the  background. 


231 


Figure  7.9:  Shows  the  reconstructed  top-pair  invariant  mass,  Mu,  for  175  GeV  top 
quarks  produced  in  1.8  TeV  pp  collisions  together  with  the  W -I-  jets  background. 
The  plot  shows  the  sum  of  the  signal  plus  background  and  corresponds  to  the 
number  of  events  per  year  (with  Lum=  100/pb)  in  a 50  GeV.  The  events  have 
survived  the  “zero-level”  lepton  plus  missing  Et  trigger  and  the  calorimeter  cell 
cuts. 

At  this  stage  the  signal  is  about  twice  the  background.  However,  the  signal  to 
background  ratio  can  be  improved  by  examining  in  more  detail  the  “shape”  of  the 
events. 


7.3  Modified  Fox- Wolfram  Moments 

In  the  previous  chapter,  we  used  1 1 manually  selected  variables  after  the  “tradi- 
tional” cuts  that,  in  our  opinion,  were  “sufficiently  good”  variables  to  describe  the 
difference  between  signal  and  background  events.  The  selection  of  these  variables 
itself  may  have  introduced  a bias  which  would  have  manifested  itself  as  a reduced 


232 


Table  7.2:  Shows  the  mean  value  and  standard  deviation  from  the  mean  of  the 
six  of  the  Modified  Fox- Wolfram  moments  applied  to  the  jets  in  the  event  with 
transverse  energy  greater  than  15  GeV.  Results  are  shown  for  the  top-pair  signal 
and  the  W-l-jets  background.  Also  shown  are  the  resulting  Fisher  coefficients. 


Moment 

Signal 

Top  (175  GeV) 

Background 
W + jets 

Fisher 

Coefficient 

mean 

stdev 

mean 

stdev 

Hi 

0.24 

0.18 

0.36 

0.25 

-0.500 

H2 

0.28 

0.16 

0.44 

0.22 

-1.282 

Hz 

0.28 

0.14 

0.40 

0.19 

-1.088 

H, 

0.28 

0.13 

0.41 

0.18 

-0.978 

Hz 

0.29 

0.12 

0.40 

0.17 

-0.544 

He 

0.29 

0.12 

0.40 

0.17 

-0.069 

signal  enhancement  compared  to  the  “ideal”  case  where  no  information  is  lost  due 
to  variable  selection. 

The  modified  Fox- Wolfram  moments  introduced  in  Chapter  3 provide  a mech- 
anism to  reduce,  or  even  eliminate,  the  need  for  input  variable  selection.  These 
moments  are  based  on  the  observation  that  the  overall  topology  of  an  event  is  cylin- 
drically  symmetric  and  that  the  topology  of  signal  events  is  (in  general)  different 
from  background  events.  For  example.  Table  7.2  shows  the  statistical  difference 
between  signal  and  background  events  for  the  first  six  modified  Fox- Wolfram  co- 
efficients, H^.  In  all  six  cases,  the  mean  values  are  significantly  different  for  signal 
and  background  events  which  indicates  that  these  moments  have  the  potential  to 
discriminate  between  signal  and  background. 

From  a mathematical  point  of  view,  if  we  were  to  use  all  modified  Fox- Wolfram 
coefficients  as  a representation  of  an  event,  we  would  not  have  lost  any  information 
at  all.  On  the  other  hand,  using  the  first  six  coefficients  is  reasonable  since  higher 


233 


moments  describe  minor  variations  in  the  overall  “shape”  of  the  event  that  are  not 
likely  to  be  significant  for  the  discrimination  between  signal  and  background. 

In  hadron-hadron  collisions  spherical  symmetry  is  lost  and  we  are  interested 
more  in  the  shape  of  events  in  the  transverse  plane.  For  example,  the  Fox- Wolfram 
moments[4,  5,  6]  when  applied  directly  to  hadron-hadron  collisions  would  interpret 
a minimum  bias  event  as  a “two-jet”  event.  Instead,  we  would  like  to  have  a min- 
imum bias  event  treated  more  like  a spherically  symmetric”  e'''e~  final  state  (i.e., 
no  structure).  To  accomplish  this,  we  define  the  modified  Fox- Wolfram  moments 
for  hadron-hadron  collisions  as 


He 


47T 

2i+\ 


jets 
-i  i 


rpi 

JZ/rp 

Et  (sum) 


(7.14) 


where  the  inner  sum  is  now  over  all  the  jets  in  the  event  with  transverse  energy, 
E^,  greater  than  15GeV  and  Q,i  = (9i,(j)i)  the  angular  location  of  the  jet.  Here, 
ET^sum)  is  the  sum  of  the  transverse  energy  of  all  the  jets  that  are  included  in 

/V 

the  sum.  These  modified  moments  also  lie  in  the  range  0 < Hi  < 1 and  by 

-A. 

definition  Hq  — Furthermore,  if  the  transverse  momentum  of  the  jets  in  the 

>N 

event  is  conserved  then  H\  = Q.  In  this  case,  however,  events  that  are  completely 
cylindrically  symmetric  about  the  beam  axis  give  « 0 for  all  i. 

Table  7.2  shows  the  mean  values  and  standard  deviations  for  six  of  the  modified 
Fox- Wolfram  moments  calculated  using  all  jets  with  > 15  GeV  for  events 

that  have  survived  the  “zero-level”  lepton  and  missing  Et  trigger  and  the  calorime- 
ter cell  cuts.  There  are  clearly  still  some  differences  between  the  jet  topologies  of 
the  top-pair  signal  and  the  W-|-jets  background.  The  mean  values  of  the  six  mo- 


234 


/V 

Figure  7.10:  Shows  the  modified  Fox- Wolfram  moment,  Hi,  calculated  using  the 
jets  in  the  event  with  transverse  energy  greater  than  15  GeV  for  top-pair  signal 
and  for  the  W + jets  background.  The  plot  shows  the  percentage  of  events  in  a 
0.05  bin.  The  events  have  survived  the  “zero-level”  lepton  plus  missing  Et  trigger 
and  the  calorimeter  cell  cuts.  ( If  the  vector  sum  of  the  momentum  of  all  the  jets 
in  the  events  is  zero  then  H\  = 0.) 


ments  Hi, ...  ,Hq  are  smaller  for  the  signal  than  the  background  indicating  that 
the  jets  originating  from  the  top-pair  signal  form  a more  cylindrically  symmetric 
pattern  when  they  emerge  from  the  event  than  do  the  background  jets.  The  top- 
pair  jets  are  more  spread  out  in  rj-cf)  space.  This  can  be  seen  in  Figures  7.10,  7.11 
and  7.12  which  show  the  Hi,  H2,  and  H4  distributions,  respectively,  for  the  signal 
and  background.  At  this  stage  one  could  simply  make  a linear  cut  on,  for  example, 
H2.  Requiring  H2  < 0.3  gives  an  additional  signal  to  background  enhancement 
of  about  2 with  a relative  efficiency  of  around  60%.  One  can  do  a little  better, 
however,  by  using  the  information  of  all  six  of  the  Hi's  simultaneously.  This  can 


235 


Figure  7.11:  Shows  the  modified  Fox-Wolfram  moment,  H2,  calculated  using  all 
the  jets  in  the  event  with  transverse  energy  greater  than  15  GeV  for  top-pair  signal 
and  for  the  W + jets  background.  The  plot  shows  the  percentage  of  events  in  a 
0.05  bin.  The  events  have  survived  the  “zero-level”  lepton  plus  missing  Ex  trigger 
and  the  calorimeter  cell  cuts. 

be  done  by  constructing  a neural  network,  by  using  Fisher  discriminates,  by  using 
multi-dimensional  linear  cuts,  etc.  We  will  apply  the  first  two  methods  to  this 
analysis. 

Although  in  this  chapter  we  only  use  jets  in  conjunction  with  the  modified 
Fox- Wolfram  moments,  it  should  be  stressed  that  the  same  analysis  could  have 
been  performed  by  using  the  calorimeter  cells  directly.  The  latter  approach  has  the 
advantage  of  not  having  to  define  jets  at  all  while  at  the  same  time  not  complicating 
the  analysis  since  the  modified  Fox- Wolfram  moments  are  equally  well  defined  in 
terms  of  the  calorimeter  cells.  In  the  next  chapter  we  will  use  the  modified  Fox- 
Wolfram  moments  with  both  jets  and  calorimeter  cells.  The  results  will  be  slightly 


236 


Figure  7.12:  Shows  the  modified  Fox-Wolfram  moment,  calculated  using  all 
the  jets  in  the  event  with  transverse  energy  greater  than  15  GeV  for  top-pair  signal 
and  for  the  W -f-  jets  background.  The  plot  shows  the  percentage  of  events  in  a 
0.05  bin.  The  events  have  survived  the  “zero-level”  lepton  plus  missing  Et  trigger 
and  the  calorimeter  cell  cuts. 

better  in  the  latter  case  which  should  not  be  surprising  since  any  definition  of  a 
jet  is  necessarily  arbitrary,  regardless  of  how  reasonable  it  may  appear  to  be. 


7.4  Neural  Network  Analysis 

As  usual,  the  first  step  in  applying  neural  networks  is  to  decide  on  the  input 
variables  that  enter  the  network.  These  variables  must  characterize  the  differences 
between  the  signal  and  the  background  and,  ideally,  they  would  constitute  the 
“perfect”  choice.  As  already  discussed,  the  modified  Fox- Wolfram  coefficients  have 
(at  least  in  principle)  the  ability  to  describe  an  event  globally  without  the  need 


237 


for  a complex  variable  selection  procedure  that  can  diminish  the  final  results  if  it 
is  not  “perfect”  (which  is  the  typical  case).  In  this  analysis,  we  chose  the  first  six 
modified  Fox- Wolfram  coefficients  to  serve  as  input  for  the  neural  network^.  I.e., 
the  input  variables,  Xi,  are 

xi  - Hi, 

/s 

= Hz, 

/V 

- Hz, 


X2  = H2, 
X4  = H4, 

/V 

xe  = He. 


The  network  is  trained  on  a sample  of  4,000  top-pair  signal  and  3,814  IT+jets 
background  events  using  the  six  inputs  shown  above  and  where  both  signal  and 
background  events  have  already  satisfied  the  lepton  and  missing  Et  trigger  and 
the  calorimeter  cell  cuts.  To  get  this  training  sample,  it  was  necessary  to  generate 
50, 000  top-pair  events  and  1, 200, 000  W+jet  events. 

As  discussed  in  chapter  4,  there  is  no  systematic  procedure  (other  than  genetic 
algorithms)  that  provides  the  best  network  topology  for  a given  problem.  Since  we 

^The  reason  for  choosing  the  first  six  coefficients,  vs.  the  first  five  or  seven  coefficients  is  not 
justified  in  this  analysis.  The  general  justification  is  that  only  the  first  N moments  are  important 
since  higher  moments  describe  too  much  detail  in  the  events  which  is  not  likely  to  matter  in  the 
final  results.  Strictly  speaking,  a consistent  investigation  of  the  number  of  moments  that  should 
be  used  would  have  been  a better  approach.  However,  this  number  is  not  universal  and  would 
have  to  be  investigated  for  each  particular  application  of  the  modified  Fox- Wolfram  coefficients. 
A slightly  generalized  version  of  the  genetic  algorithms  described  in  Chapter  5 would  be  a perfect 
candidate  for  a consistent  investigation  of  the  number  of  moments  needed  in  each  particular  case. 
For  example,  the  simplest  approach  would  be  to  let  this  integer  number  be  a gene  and  to  determine 
its  value  dynamically.  If  the  dynamically  determined  value  is  the  same  as  the  maximum  number 
of  coefficients  given  to  the  individuals,  then  this  would  indicate  that  the  maximum  number  is  not 
sufficient  or,  alternatively,  that  some  information  was  lost  as  a consequence.  On  the  other  hand, 
if  the  individuals  in  the  population  settled  on  a value  that  is  not  the  maximum  value,  one  would 
be  justified  to  “believe”  that  there  was  no  information  loss  due  to  the  selection  of  the  maximum 
number  of  modified  Fox- Wolfram  coefficient  given  to  the  individuals  in  the  population  (which 
could  be  neural  networks,  linear  cuts,  Fisher  discriminates,  etc. 


238 


Network  Response 


Netwrork«»6  - 12  - 1 


17S  GeVTop  Quark  in  1.8  TeV  PbarP 
Collisions  Aftsf  *26ro-lever  Lepton, 
missing  ET Trigger  and,  Cell  Cuts 


0.025  0.125  0.225  0.325  0.'425  0.525  0.625  0.725  0.825  0.925 

Network  Output 


□Top  Signal  oW+ Jets  Background 


20%  ^ 

Wl 

18%  •• 

t-wmyy: 

CO 

16% 

m 

o 

1<^% 

f- 

O 

12%- 

C, 

10%- 

ii 

e 

8%- 

§ 

6%  - 

LU 

A%  - 

Z%  -• 

0%  4 

Figure  7.13:  Shows  the  network  response,  Znet,  for  the  sample  of  signal  and  back- 
ground events  used  in  the  training.  The  plot  corresponds  to  the  percentage  of 
events  with  Znet  within  a 0.05  bin  for  the  top-pair  signal  and  the  lF-|-jets  back- 
ground. The  events  have  survived  the  “zero- level”  lepton  plus  missing  Et  trigger 
and  the  calorimeter  cell  cuts. 

are  not  using  genetic  algorithms  in  this  analysis,  we  are  interested  in  the  simplest 
network  that  can  discriminate  signal  from  background.  Since  the  modified  Fox- 
Wolfram  coefficients  (as  will  become  clear  later)  are  an  excellent  representation  for 
an  event,  we  use  a very  simple  network  with  only  one  hidden  layer.  Its  structure  is 
6-12-1^  which  implies  97  thresholds  and  synaptic  weights.  Figure  7.13  shows  the 
network  response  (i.e.,  2„et)  for  the  sample  of  signal  and  background  events  used 
in  the  training.  There  are  some  events  around  Zmt  = 0.5  for  which  the  net  cannot 

distinguish  between  signal  and  background.  Nevertheless,  the  net  does  allow  for 

^Once  again,  the  first  number  is  the  number  of  inputs.  The  second  number  indicates  the 
number  of  neurons  in  the  first  hidden  layer,  etc.  The  last  number  specifies  the  number  of  neurons 
in  the  output  layer  which,  in  all  of  our  applications,  is  set  to  1. 


239 


Network  Perfomnance 


Effldenqir 


Figure  7.14:  Shows  the  enhancement  versus  the  efficiency  for  the  training  sample 
of  events  for  a 6-12-1  neural  network  with  97  memory  parameters.  Each  point  in 
the  plot  corresponds  to  a different  choice  for  the  network  cut-off  with  the  lower 
efficiencies  and  higher  enhancements  corresponding  to  larger  values  Zcut-  The  net- 
work enhancements  are  compared  with  the  enhancements  arrived  by  the  use  of 
Fisher  discriminates. 

some  separation  of  signal  and  background.  The  net  clearly  recognizes  some  events 
as  signal  or  background,  while  for  other  events  there  is  an  overlap  and  the  net 
cannot  distinguish  between  the  two. 

The  next  step  is  to  perform  a network  cut-off  and  assign  any  event  with  Znet  > 
Zcut  to  be  signal  and  events  with  Znet  < Zcut  to  be  background.  The  enhancement 
and  efficiency  of  the  network  depend  on  the  value  chosen  for  Zcut,  where  the  network 
enhancement  and  efficiency  are  defined  as 


jpnet 
^ enh 


% of  signal  with  Znet>Zcut 
% of  background  with  Znet>Zcut 


240 


Feff  = % of  signal  with  Znet  > Zcut- 


The  overall  network  performance  can  be  characterized  by  the  single  curve  of 
the  network  enhancement  versus  the  network  efficiency  shown  in  Figure  7.14.  Each 
point  in  Figure  7.14  corresponds  to  a different  choice  for  the  network  cut-off  with 
the  lower  efficiencies  and  higher  enhancements  corresponding  to  larger  values  of 
Zcut-  For  example,  a net  cut  of  Zcut  — 0.75  corresponds  to  an  additional  enhance- 
ment of  about  4 with  a relative  efficiency  of  about  47%. 

7.5  Fisher  Discriminates  Analysis 

As  in  the  previous  chapter,  we  apply  Fisher  discriminates  to  the  same  input 
data  sample  to  determine  whether  neural  networks  provide  additional  signal  en- 
hancement. The  input  to  Fisher  discriminates  is  still  the  set  of  the  first  six  modified 
Fox-Wolfram  coefficients. 

In  this  case,  for  each  input  variables  there  is  one  Fisher  output  F.  We  have 
determined  the  Fisher  coefficients  for  the  sample  of  signal  and  background  events 
used  to  train  the  neural  network  and  the  Fisher  response,  F,  for  these  events  is 
shown  in  Figure  7.15.  In  plotting  the  Fisher  response  in  Figure  7.15,  we  have 
shifted  the  values  of  F to  lie  between  zero  and  one  as  follows: 


F = 


F - F 
— F 

^ max  min 


(7.15) 


241 


Figure  7.15:  Shows  the  “shifted”  Fisher  response,  F,  for  the  sample  of  signal  and 
background  events  used  in  the  training  of  the  neural  network.  The  plot  corresponds 
to  the  percentage  of  events  with  F within  a 0.05  bin  for  the  top-pair  signal  and 
the  kF-l-jets  background.  The  events  have  survived  the  “zero-level”  lepton  plus 
missing  Ft  trigger  and  the  calorimeter  cell  cuts.  The  position  of  our  “Fisher  cut” 
is  marked  by  the  dotted  line. 

In  this  analysis,  all  the  inputs,  Xi,  lie  between  zero  and  one  and  all  the  Fisher 
coefficients,  aj,  are  negative  which  implies  that 


Ffnax  0 and  — (7.16) 

i 

The  separation  between  signal  and  background  in  Figure  7.15  is  about  the 
same  as  the  network.  As  with  the  network,  the  overall  Fisher  performance  can 
be  characterized  by  the  single  curve  of  the  Fisher  enhancement  versus  the  Fisher 


242 


Figure  7.16:  Shows  the  reconstructed  top-pair  invariant  mass,  Mti,  for  175  GeV  top 
quarks  produced  in  1.8  TeV  p-p  collisions  together  with  the  W -t-  jets  background. 
The  plot  shows  the  sum  of  the  signal  plus  background  and  corresponds  to  the 
number  of  events  per  year  (with  Lum=100/pb)  in  a 50  GeV.  the  events  have 
survived  the  “zero-level”  lepton  plus  missing  Et  trigger,  the  calorimeter  cell  cuts, 
and  the  Fisher  cut-off. 

efficiency  which  is  shown  in  Figure  7.14  together  with  the  network  performance. 
Each  point  corresponds  to  a different  choice  for  the  Fisher  cut-off. 

Figure  7.14  shows  that  Fisher  discriminates  have  essentially  the  same  perfor- 
mance curve  as  the  neural  network.  This  implies  that  the  modified  Fox- Wolfram 
moments  are  a very  good  representation  of  the  event  signature.  An  alternative, 
but  not  mutually  exclusive,  explanation  would  be  that  within  the  constraints  of 
layered  feedforward  neural  networks  using  sigmoidal  activation  functions,  the  mean 
square  quadratic  error  performance  measure  and  the  backpropagation  algorithm. 


243 


the  difference  in  signal  enhancement  with  respect  to  the  assumptions  inherent  in 
Fisher  discriminates  is  undetectable. 

The  latter  is  a very  important  implication  since  it  illustrates  a non-trivial  ap- 
plication of  neural  networks.  In  the  previous  chapter,  Fisher  discriminates  were 
inferior  to  neural  networks  and,  therefore,  one  should  use  the  neural  networks  re- 
sults. In  this  case,  however,  the  performance  of  neural  networks  is  the  same  as 
Fisher  discriminates.  It  stands  to  reason  that 

1.  Either  Fisher  discriminates  (when  applied  to  the  modified  Fox- Wolfram  co- 
efficients) are  sufficient  in  this  case. 

2.  Or,  alternatively,  we  did  not  find  a better  neural  networks  solution  because 
of  the  limitations  imposed  by  the  network  structure  as  well  as  the  training 
algorithm. 

Since  we  can  never  be  sure  that  we  have  the  best  possible  neural  networks 
solution,  the  only  practical  alternative  is  to  declare  the  modified  Fox- Wolfram 
moments  as  an  outstanding  representation  for  a collider  event. 

We  complete  our  analysis  by  making  a cut  on  F as  follows: 

F > 0.75.  (7.17) 

As  can  be  seen  in  Table  7.1,  this  Fisher  cut  provides  an  additional  enhancement 
of  around  4 with  a relative  efficiency  of  about  44%  resulting  in  an  overall  signal  to 
background  ratio  of  about  9. 

Figure  7.16  shows  the  reconstructed  parton-parton  center  of  mass  energy  for 
the  top-pair  signal  and  the  W-f-jets  background  after  the  “zero  level”  lepton  plus 


244 


missing  Et  trigger  and  the  calorimeter  cell  cuts  and  the  Fisher  cut.  The  plot  shows 
the  sum  of  the  signal  and  the  background.  As  can  be  seen  from  this  figure,  the 
signal  events  are  about  an  order  of  magnitude  larger  than  the  background  events. 

7.6  Summary 

The  procedure  in  this  chapter  uses  the  event  topology  to  select  the  top-quark 
signal  for  the  four-jet  decay  mode  over  the  W+  jets  and  bb+  jets  background  in 
hadron-hadron  collisions.  Our  technique  can  be  summarized  as  follows: 

1.  Use  the  lepton  and  missing  transverse  energy  trigger. 

2.  Perform  calorimeter  cell  cuts. 

3.  Define  jets  and  use  the  first  six  modifed  Fox- Wolfram  moments. 

4.  Apply  neural  networks  to  the  modifed  Fox- Wolfram  moments. 

5.  Apply  Fisher  discriminates  to  the  modifed  Fox- Wolfram  moments. 

The  neural  network  results  are  similar  to  Fisher  discriminates  which  indicates 
that  the  first  six  modified  Fox- Wolfram  moments  are  an  excellent  representation 
of  the  events  signalture.  The  overall  signal-to-background  enhacements  is  around 
370  with  an  efiiciency  of  30%.  The  signal  to  background  ratio  is  about  9. 

We  did  not  use  6-tagging  which  consists  of  the  identification  of  a 6-quark  jet  vs. 
any  other  jet.  Applying  6-tagging  to  the  current  analysis  would  further  improve  the 
separation  of  signal  from  background  even  if  it  is  done  as  a last  step  of  processing. 


CHAPTER  8 

THE  SIX-JET  DECAY  MODE  OF  TOP 


The  top  quark  decays  into  a 6-quark  and  a W boson  {t  — >■  bW).  The  W boson 
decays  into  a lepton  (e  or  fi)  and  a neutrino  about  22%  (2/9)  of  the  time  and 
into  a quark-antiquark  pair  about  67%  (6/9)  of  the  time.  This  implies  that  when 
top-pairs  are  produced  in  hadron-hadron  collisions,  pp  -4  ti+X,  both  of  the  W 
bosons  decay  into  a lepton  and  neutrino  only  about  5%  of  the  time  resulting  in 
the  final  state  consisting  of  two  leptons,  two  neutrinos,  and  two  b-quarks  {Itvvbb). 
This  distinctive  final  state  constitutes  the  “discovery”  mode  of  the  top  quark  at 
hadron  colliders[30,  33].  On  the  other  hand,  it  is  considerable  more  likely  for  one 
of  the  W bosons  to  decay  into  a quark-antiquark  pair  resulting  in  a final  state 
consisting  of  a lepton,  a neutrino,  a bb,  and  a qq  pair.  The  iubbqq  mode  occurs 
about  35%  of  the  time  or  about  7 times  more  often  than  the  purely  leptonic  mode. 
The  backgrounds  are  larger  for  this  decay  mode,  but  so  is  the  signal.  When  each 
of  the  four  outgoing  quarks  produce  a distinct  jet,  the  resulting  event  contains  a 
lepton,  a neutrino,  and  four  jets  {^ujjjj).  This  decay  mode  is  used  to  analyze  the 
properties  of  the  top  quark  in  more  detail  and  to  determine,  for  example,  the  top 
mass  [32,  31].  The  purely  hadronic  decay  mode  shown  in  Fig.  8.1  occurs  about 
60%  of  the  time,  and  produces  the  “six-jet”  topology  shown  in  Fig.  8.2.  The 
six-jet  decay  mode  of  top-pair  production  is  buried  underneath  “ordinary”  QCD 
multi-jet  production  such  as  that  illustrated  in  Fig.  8.3. 


245 


246 


Figure  8.1:  Illustration  of  top-pair  production  in  proton-antiproton  collisions  in 
which  both  of  the  W bosons  decay  hadronically  resulting  in  a final  state  consisting 
of  a bb  pair  and  two  qq  pairs. 


aittiaiiairk 


anti-b 


quark 


.proion 


anti^roion 


Figure  8.2:  Shows  the  event  topology  for  the  top-pair  signal.  If  each  of  the  outgoing 
partons  produces  a distinct  jet,  then  the  final  state  contains  six  jets. 


247 


Figure  8.3:  Illustration  of  the  QCD  multi-jet  background  to  the  top-pair  production 
in  proton-antiproton  collisions  shown  in  Fig.  8.1. 

In  this  chapter,  we  concentrate  on  isolating  the  tt  six-jet  mode  from  the  back- 
ground using  only  the  event  topology [36].  The  signal  in  Fig.  8.1  contains  b quarks 
whereas  the  QCD  multi-jet  background  in  Fig.  8.3,  in  general,  does  not.  There- 
fore, 6-quark  tagging  will  improve  the  signal  to  background  ratio.  However,  we 
would  like  to  investigate  how  well  one  can  do  using  only  the  event  topology. 

The  event  topology  is  represented  by  the  modified  Fox- Wolfram  moments  de- 
fined in  Chapter  3.  We  use  multi-dimensional  linear  cuts  on  the  modified  Fox- 
Wolfram  coefficients  to  directly  maximize  the  statistical  significance  measure,  Rf, 
introduced  in  Chapter  3.  This  maximization  is  performed  by  using  the  genetic 
algorithms  described  in  5. 


248 


8.1  Event  Simulation  and  Detection 


ISA  JET  version  7.06  is  used  to  generate  top  quarks  with  a mass  of  175  GeV 
in  1.8  TeV  proton-antiproton  collisions.  At  this  energy,  175  GeV  top-pairs  are 
produced  via  quark-antiquark  annihilation,  qq  ->  ti,  about  88%  of  the  time  and  by 
gluon-gluon  fusion,  gg  ti,  the  remaining  12%.  We  refer  to  this  as  the  “signal”. 
We  have  normalized  the  top  cross  section  to  be  7.5  pb  corresponding  to  750  events 
with  an  integrated  luminosity  of  100/pb  [30,  33,  36].  The  “background”  consists  of 
ordinary  QCD  multi-jet  events  generated  using  ISAJET  with  the  hard-scattering 
transverse  momentum,  kr,  greater  than  20  GeV. 

We  do  not  attempt  to  do  a detailed  simulation  of  the  CDF  or  DO  detector[30. 


33] . Events  are  analyzed  by  dividing  the  solid  angle  into  “calorimeter”  cells  having 
size  Ar)A(f)  = 0.1  x 7.5°,  where  t]  and  <j)  are  the  pseudorapidity  and  azimuthal  angle, 
respectively.  Our  simple  calorimeter  covers  the  range  |t7|  < 4 and  has  3840  cells. 
A single  cell  has  an  energy  (the  sum  of  the  energies  of  all  the  particles  that  hit  the 
cell  excluding  neutrinos)  and  a direction  given  by  the  coordinates  of  the  center  of 
the  cell.  The  transverse  energy,  Et-,  of  each  cell  is  computed  from  the  cell  energy 
and  direction.  We  have  taken  the  energy  resolution  to  be  perfect,  which  means 
that  the  only  resolution  effects  are  caused  by  the  lack  of  spatial  resolution  due  to 
the  cell  size. 


8.2  Modified  Fox- Wolfram  Moments 

The  modified  Fox- Wolfram  coefficients  are  used  to  describe  the  topology  of  each 
event.  As  in  the  previous  chapter,  this  will  help  us  to  not  make  arbitrary  decisions 


249 


about  the  input  variables.  In  addition,  however,  we  will  apply  the  modified  Fox- 
Wolfram  moments  directly  to  the  calorimeter  cells  thereby  avoiding  the  need  to 
define  jets.  In  conjunction  with  the  fact  that  we  do  not  have  any  cuts  on  the 
data  before  computing  the  modified  Fox- Wolfram  coefficients,  the  analysis  in  this 
chapter  turns  into  a powerful  tool  for  jet  physics  analysis  since  it  does  not  depend 
on  both  a set  of  cuts  that  may  not  be  the  “best”  ones  as  well  as  on  the  definition 
jets  (when  using  the  calorimeter  cells).  We  still  use  jets  in  order  to  compare  the 
results  with  direct  (calorimeter  cells)  computations. 


8.2.1  Constructing  Fox- Wolfram  Moments  from  Calorimeter  Cells 

When  we  are  using  the  calorimeter  cells  directly,  the  modified  Fox- Wolfram 
moments  for  hadron-hadron  collisions  are  given  by 


Heicell) 


47T 

2£  + l 


+£.  cells 

E lE^rw) 

m=—i  i 


TPi 

H/rp 

Et  (sum) 


where  the  inner  sum  is  over  all  the  calorimeter  cells  in  the  event  with  transverse 
energy,  E^,  greater  than  some  minimum  (for  example,  5 GeV)  and  = {9i,  ^i)  are 
the  angular  locations  of  the  center  of  the  cell.  In  this  case,  ET{sum)  is  the  total 
transverse  energy  of  all  the  cells  that  are  included  in  the  sum.  The  calorimeter 
cells  contain  all  the  information  concerning  the  topology  of  the  event  and  it  is  not 
necessary  to  define  jets.  These  modified  moments  also  lie  in  the  range  0 < He  < 1 

ys 

and  by  definition  Hq  = \. 

Table  8.1  shows  the  mean  values  and  standard  deviations  for  six  of  the  modified 
Fox- Wolfram  moments  calculated  using  all  cells  with  Ericell)  > 5 GeV  for  the 


250 


Table  8.1:  Shows  the  mean  value  and  standard  deviation  from  the  mean  (meania) 
of  six  of  the  Modified  Fox- Wolfram  moments,  constructed  from  the  calorimeter 

cells  (with  ET{cell)  > 5 GeV)  and  from  jets  with  Rj  = 0.4  and  Erijet)  > 15  GeV. 
Results  are  shown  for  the  top-pair  signal  and  the  QCD  multi-jet  background  in 
1.8  TeV  proton-antiproton  collisions. 


He{cell) 

Signal  Background 

Heijet)  Rj  = 0.4 
Signal  Background 

Hi 

0.2053±0.0797  0.3160T0.1698 

0.1970±0.0784  0.3104±0.1748 

H2 

0.2827±0.1093  0.5479T0.3581 

0.2711±0.1046  0.5557±0.3737 

Hz 

0.2670±0.0951  0.3849±0.1883 

0.2593±0.0934  0.3890±0.1985 

Ha 

0.2738±0.0959  0.4774T0.2670 

0.2713±0.0976  0.4937±0.2894 

He 

0.2688±0.0908  0.4058±0.1946 

0.2723±0.0964  0.4223±0.2150 

He 

0.2640±0.0867  0.4463T0.2296 

0.2744±0.0965  0.4738±0.2612 

top-pair  signal  and  the  QCD  multi-jet  background.  The  mean  values  of  the  six 
moments  Hi,. He  are  considerably  smaller  for  the  signal  than  the  background. 
For  our  calorimeter  (3849  cells  with  A?yA(^  = 0.1  x 7.5°)  equal  transverse  energy 
in  every  cells  yields  = 0 for  odd  £ and  H^  = 0.39,  H^  = 0.23,  He  = 0.15. 
This  corresponds  to  a cylindrically  symmetric  “blob”.  The  signal  lies  closer  to 
this  “blob”  configuration  in  .^^-space  than  do  most  of  the  background  events. 
The  background  contains  many  two,  three,  and  four  jet  configurations  in  addition 
to  some  higher  jet  multiplicity  configurations.  The  top-pair  transverse  energy 
deposition  is  usually  more  spread  out  in  rj-cf)  space  than  the  background.  This  can 
be  seen  in  Figs.  8.4  and  8.5  which  show  the  H2,  and  H^  distributions,  respectively, 
for  the  signal  and  background.  In  a given  event,  all  six  moments  are,  on  the  average, 
small  for  the  top-pair  signal,  whereas  for  the  background  usually  at  least  one  of 
the  moments  is  large.  This  can  be  seen  in  Fig.  8.6  which  shows  the  distribution  of 


251 


- ^ >S  ^ 

the  maximum  of  the  six  moments,  Hi, , Hq,  in  each  event  for  the  top-pair  signal 
and  the  QCD  multi-jet  background. 


ET(c«H)  > 6 G«V 


Modified  H2  Applied  to  Ceils 


w 

I 

> 


I 

£ 

Q. 


16%  T 


175  G«VTop  Quark 
1.8  TaV  Proton-AntIproton  CoHlalona 


Bafbra  HL  ciita 


0 025  0.125  0.225  0.325  0 425  0.525  0 625  0.725  0.825  0.925 


H2 

■ Top  Signal  BQCD  Jata  Background 


Figure  8.4:  Shows  the  modified  Fox- Wolfram  moment,  H2,  calculated  directly  from 
the  calorimeter  cells  with  Exicell)  > 5GeV  for  top-pair  signal  and  for  the  QCD 
multi-jet  background.  The  plot  shows  the  percentage  of  events  in  a 0.05  bin  with 
the  sum  of  all  bins  normalized  to  100%. 


8.2.2  Constructing  Modified  Fox- Wolfram  Moments  from  Jets 

Instead  of  using  the  calorimeter  cells  directly  to  characterize  the  event  topology, 
one  can  define  “jets”  and  use  them  to  construct  modified  Fox- Wolfram  moments. 
We  define  jets  using  a simple  algorithm.  One  first  considers  the  “hot”  cells  (those 
with  transverse  energy  greater  than  5GeV).  Cells  are  combined  to  form  a jet  if 
they  lie  within  a specified  “radius”  Rj  = -|-  in  rj-4)  space  from  each  other. 

Jets  have  an  energy  given  by  the  sum  of  the  energy  of  each  cell  in  the  cluster  and 
a momentum  pj  given  by  the  vector  sum  of  the  momentums  of  each  cell.  The 
invariant  mass  of  a jet  is  simply  Mj  = E]  - pj  • pj.  In  this  analysis. 


we  examine 


252 


> 

ui 

•8 

I 

i 

Q. 


12% 


10% 


8% 


6% 


4% 


2% 


16% 

14% 


B*for«  HL  cut* 


ET(c*N)  > 6 GtV 


0% 

0.025 


0.125 


0.225 


0.325 


0425 


0525 

H4 


0 625 


0.725 


0 825 


0925 


18% 


176  GtVTopQuvfc 
1 .8  T«V  Proton-AntIproton  CoNMont 


Modified  H4  Applied  to  Cells 


■ Top  Signal  □ QCD  Jata  Background 


Figure  8.5:  Shows  the  modified  Fox- Wolfram  moment,  calculated  directly  from 

the  calorimeter  cells  with  Exicell)  > 5 GeV  for  top-pair  signal  and  for  the  QCD 
multi-jet  background.  The  plot  shows  the  percentage  of  events  in  a 0.05  bin  with 
the  sum  of  all  bins  normalized  to  100%. 


Maximum  Modified  HL  Applied  to  Cells 


ET(caH)  > 5 GaV  ITS  GaV  Top  Quark 

1.8  TaV  Proton-AntIproton  CoNlaiona 


16%  j 

14%  ■■ 

12%  ■■ 

m 

1 

10%  ■■ 

> 

UJ 

8% 

8 

a 

6% 

8 

Q. 

4%  ■■ 

2% 

0%  -■ 

0025 


Bafora  HL  cuta 


0.125  0.225  0.325  0 425  0 525  0 625  0 725  0 825  0 925 

Maximum  HL 


■ Top  Signal  B QCD  Jata  Background 


Figure  8.6:  Shows  the  largest  of  the  first  six  modified  Fox- Wolfram  moments,  Hi, 
for  t = 1, ...  ,6  in  each  event  calculated  directly  from  the  calorimeter  cells  with 
ET{cell)  > 5 GeV  for  top-pair  signal  and  for  the  QCD  multi-jet  background.  The 
plot  shows  the  percentage  of  events  in  a 0.05  bin  with  the  sum  of  all  bins  normalized 
to  100%. 


253 


both  “narrow”,  Rj  — 0.4,  and  “fat”,  Rj  — 0.7,  jets,  where  jets  are  required  to  have 
at  least  15  GeV  of  transverse  energy. 

The  modified  Fox- Wolfram  moments  are  constructed  from  jets  as  follows. 


HeiJets)  = 


An 


-\-i  jets 
m=—l  i 


Eip 

^^^(sum) 


2 

? 


where  the  inner  sum  is  now  over  all  the  jets  in  the  event  with  transverse  energy, 
greater  than  some  minimum  (which  we  take  to  be  15  GeV)  and  VLi  = (0i,4>i) 
are  the  angular  locations  of  the  jets.  Here,  ET{sum)  is  the  sum  of  the  transverse 
energy  of  all  the  jets  that  are  included  in  the  sum. 

Table  8.1  shows  the  mean  values  and  standard  deviations  for  six  of  the  modified 
Fox- Wolfram  moments  calculated  using  all  jets  with  Rj  = 0.4  and  EriJet)  > 
15  GeV  for  the  top-pair  signal  and  the  QCD  multi-jet  background.  The  mean 
values  are  similar  to  those  constructed  directly  from  the  cells  and  as  before  the 
mean  values  of  the  six  moments  Hi, . . . ,Hq  are  considerably  smaller  for  the  signal 
than  for  the  background. 

One  can  use  the  modified  Fox- Wolfram  moments  constructed  either  from  the 
cells  or  from  jets.  In  either  case  the  Hes  characterize  the  topology  of  the  event. 
At  this  point  one  could  make  a simple  cut  on  Ht{max)  to  enhance  signal  over 
background  (see  Fig.  8.6),  but  one  can  do  better  by  considering  all  six  moments. 

/S  Ak 

The  six  moments  iifi, . . . , iife  form  a six  dimensional  space  in  which  different  regions 
of  the  space  correspond  to  different  event  topologies.  They  range  from  zero  to  one 
and  make  excellent  inputs  into  a neural  network  or  Fisher  discriminate[7].  In  this 

/N 

paper,  we  will  restrict  events  to  lie  within  a region  of  the  six  dimensional  if^-space. 


254 


The  region  will  be  defined  by  Li  < Hi  < Ri  ioi  £ = 1, ...  ,6.  The  left,  Li,  and 
right,  Ri,  cuts  will  be  selected  using  a genetic  algorithm  to  maximize  the  signal 
over  the  square  root  of  the  background. 


8.3  Multi-Dimensional  Linear  Cuts  and  Genetic  Algorithms 


We  are  now  interested  in  finding  a set  of  left,  Li,  and  right,  Ri,  cuts  {£  = 
1, . . . , 6)  that  maximizes  the  statistical  significance  measure,  Rf,  given  by 


/S 

where  Li  < Hi  < Ri, 


t=l. 


where  Ngig  is  the  number  of  surviving  signal  events  and  Nbak  is  the  number  of 
surviving  background  events.  The  explicit  form  oi  Rj  when  applied  to  linear  cuts 
is  given  in  Eq.  5.9. 

Our  data  sample  consists  of  the  set  of  six  modified  Fox- Wolfram  moments 
i/i, . . . , ifg  for  10, 000  top-pair  signal  events  and  10, 000  QCD  multi-jet  background 
events.  The  real  valued  function,  Rf,on  the  data  is  the  number  of  signal  events  over 
the  square  root  of  the  number  of  background  events  that  lie  within  a region  of  the 
six  dimensional  if^-space  defined  by  the  hypercube  Li  < Hi  < Ri  ior  £ = \, . . . ,£>. 

In  biological  terms,  the  set  of  signal  and  background  events  is  the  environment 
in  which  a population  resides,  the  real  valued  function,  Rj,  is  analogous  to  the 
overall  fitness  of  an  individual  for  survival  and  reproduction,  and  the  12  parameters 
Li  and  Ri  (£  = 1 , . . . , 6)  are  genes. 

Each  individual  in  the  population  is  represented  by  the  class  LinearCuts  intro- 
duced in  Section  5.5.  In  particular,  we  make  use  of  the  implicit  gradient  descent 


255 


algorithm  also  described  there.  The  complete  genetic  map  is  shown  in  Fig.  8.7. 
It  consists  of  the  mutation  rate,  Rm,  the  crossover  rate,  Re,  the  parameter  of  the 


R,>m 

Rc 

A 

Li 

Ri 

• • • 

L^ 

Re 

Figure  8.7:  Genetic  map  of  LinearCuts  individuals  when  applied  to  the  first  six 
modified  Fox- Wolfram  moments.  The  first  two  genes,  R^  and  Rc,  are  the  mutation 
and  crossover  rates  respectively.  The  third  gene.  A,  is  a parameter  of  the  implicit 
gradient  descent  algorithm.  The  first  three  genes  are  not  used  in  the  performance 
measure  Rj.  The  next  12  genes  are  the  left  and  right  cut  pairs  for  each  input 
variable  Xi.  Their  order  refiects  the  fact  that  each  pair  of  left  and  right  cuts  for  the 
same  variable  forms  an  operon  and,  therefore,  they  should  be  next  to  each  other. 


implicit  gradient  descent  algorithm.  A,  and  12  genes  corresponding  to  the  left, 
Li,  and  right,  Ri,  cuts  on  each  modified  Fox- Wolfram  coefficient.  All  genes  are 
represented  by  2 bytes  and  are  restricted  to  the  interval  [0, 1]. 

As  explained  in  Chapter  5,  finding  an  “optimal”  solution  or  solutions  is  achieved 
through  genetic  evolution  of  a population  over  many  generations.  We  use  a popu- 
lation with  a maximum  size  of  1, 000  individuals.  Their  average  life  span  is  several 
simulation  years  and  we  evolve  them  until  the  mutation  rate,  the  crossover  rate 
and  the  A parameter  all  become  zero  for  most  of  the  individuals.  This  typically 
implies  50  to  100  generations  of  evolution. 

During  reproduction,  the  new  individual  is  subject  to  crossover  and  mutation. 
Since  we  are  also  using  the  implicit  gradient  descent  algorithm,  after  crossover  and 
mutation,  we  shift  one  of  the  left,  Lt,  or  right.  Re,  cuts  by  an  amount  A, 


Li  ^ Li  it  A or  Ri  — ^ Ri  i A. 


(8.4) 


256 


As  already  discussed,  this  allows  for  faster  refinement  of  the  solution  and  it  can 
further  enhance  the  convergence  properties  of  the  genetic  algorithm.  At  the  same 
time,  however,  it  makes  the  genetic  algorithm  more  local  in  nature.  Due  to  the 
low  dimensionality  of  the  parameter  space,  however,  this  is  not  a big  problem. 

Table  8.2  shows  the  left,  Lt,  and  right,  Ri,  cuts  (£  = 1, . . . , 6)  on  the  Hi's  de- 
termined from  our  genetic  algorithm  procedure  to  maximize  signal  over  the  square 
root  of  the  background.  We  consider  three  cases.  In  the  first  case  the  modified 
Fox- Wolfram  moments  are  constructed  directly  from  the  calorimeter  cells,  Hi{cell), 
with  Exicell)  > 5GeV.  The  other  two  cases  are  for  modified  Fox- Wolfram  mo- 
ments constructed  from  “narrow”,  Rj  = 0.4,  jets  and  from  “fat”,  Rj  = 0.7,  jets, 
Hi{jet),  with  EriJet)  > 15GeV. 


Table  8.2:  Shows  the  Hi  cuts  determined  from  the  genetic  algorithm  maximizing 
the  signal  over  the  square  root  of  the  background.  The  Hi's  are  restricted  to  lie  in 
the  region  < .^^  < 7?^  for  ^ = 1 , . . . , 6 and  are  constructed  from  the  calorimeter 
cells  directly,  Hi{cell),  or  from  “narrow”,  Rj  = 0.4,  jets  and  from  “fat”,  Rj  = 0.7, 
jets{Hi{jet)). 


Hi{cell)  Cuts 
Left  (L)  Right  (R) 

Hi{jet)  Cuts  Rj  = 0.4 
Left  (L)  Right  (R) 

Hi(jet)  Cuts  Rj  = 0.7 
Left  (L)  Right  (R) 

Hr 

0.000198  0.347951 

0.000000  0.216602 

0.007355  0.217731 

H2 

0.011261  0.225223 

0.000000  0.218647 

0.000000  0.256138 

Hz 

0.013932  0.249973 

0.000000  0.265553 

0.009720  0.160235 

Ha 

0.000565  0.588556 

0.043092  0.381796 

0.021011  0.491890 

Hs 

0.051927  0.192233 

0.000000  0.288945 

0.018845  0.395163 

0.026032  0.912840 

0.081071  0.726467 

0.026642  0.794415 

257 


8.4  Reconstructing  the  Top-Pair  Invariant  Mass 
The  top-pair  invariant  mass,  M^i,  corresponds  to  the  center-of-mass  energy, 

/S 

Ecm,  of  the  underlying  parton-parton  two-to-two  subprocess  which  has  a threshold 
at  twice  the  mass  of  the  top  quark,  Ecm  > 2Mtop-  Although  one  cannot  precisely 
reconstruct  the  parton-parton  center  of  mass  energy,  the  hope  is  that  one  will  be 
able  to  observe  a peak  in  the  reconstructed  top-pair  invariant  mass  at  twice  the  top 
quark  mass.  The  size  of  this  peak  relative  to  the  background  determines  whether 
this  mode  can  be  seen.  The  top-pair-invariant  mass  can  be  reconstructed  from  the 
outgoing  jets  or  directly  from  the  calorimeter  cells. 

8.4.1  Using  the  Calorimeter  Cells  Directly 

The  parton-parton  invariant  mass  can  be  constructed  directly  from  the  calori- 
meter cells  as  follows: 

Mtl  — ^cells  ~ Ptxlls  > (8-5) 

where 

cells 

Pcells  ~ ^ V Pi’)  (^*®) 

i 

and 

cells 

Ecells  = E Ei-  (8-7) 

i 

The  overall  cell  energy,  Eceiu,  and  momentum,  Pceiu,  is  constructed  by  summing 
over  all  cells  with  transverse  energy  greater  that  some  minimum  (which  we  take 
to  be  5GeV). 


258 


8.4.2  Using  the  Outgoing  Jets 

The  top-pair  invariant  mass,  can  be  constructed  from  the  energy  and 
momentum  of  the  outgoing  jets  in  the  event  as  follows: 


Ml  = El..  - n.2 


'jets  Pjets  ’ 


(8.8) 


where 


jets 

Pjets  — Pi? 


(8.9) 


and 


jets 

Ejets  ~ ^ 


(8.10) 


The  overall  jet  energy,  Ejets,  and  momentum,  pjets,  is  constructed  by  summing 
over  all  jets  with  transverse  energy  greater  than  15  GeV. 


Figure  8.8:  Shows  the  multiplicity  of  “fat”  jets  {Rj  = 0.7)  with  transverse  energy 
greater  than  15  GeV  for  the  top-pair  signal  and  the  QCD  multi-jet  background. 
The  plot  shows  the  percentage  of  events  with  N jets  with  EriJet)  > 15  GeV. 


259 


8.5  Isolating  Multi-Jet  Topologies 


8.5.1  Using  Jet  Multiplicity  Cuts 

Fig.  8.8  shows  the  multiplicity  of  jets,  Nj,  (with  Rj  — 0.7  and  Ex  > 15  GeV) 
for  the  top-pair  signal  and  the  QCD  multi-jet  background.  One  obvious  way  to 
enhance  the  top-pair  signal  over  the  background  is  to  demand  the  events  to  have  a 
minimum  number  of  jets,  Nj{min)  (usually  taken  to  be  five).  Table  8.3  shows  that 

Table  8.3:  175  GeV  top  quark  pairs  produced  in  1.8  TeV  proton-antiproton  colli- 
sions. The  table  shows  the  number  of  events  (with  C = 100/pb)  for  the  top-pair 
signal  and  the  QCD  multi-jet  background  remaining  after  making  a jet  multiplicity 
cut  {Nj  > 5,  Rj  — 0.7,  Ex{jet)  > 15  GeV  and  after  making  various  Ht  cuts.  The 

As  ^ 

HiS  are  constructed  from  the  calorimeter  cells  directly,  He{cell),  or  from  narrow, 
Rj  = 0.4,  jets  and  from  fat,  Rj  = 0.7,  jets,  He{jet).  The  H(S  are  restricted  to  lie 

in  the  region  < Hi  < Ri  for  £ = 1, ...  ,6,  where  the  left,  Li,  and  right,  Ri, 
cuts  are  selected  using  a genetic  algorithm  to  maximize  the  signal  over  the  square 
root  of  the  background  and  are  given  in  Table  8.2.  The  top-pair  invariant  mass  is 
calculated  either  directly  from  the  cells  (cell  mass)  or  from  the  jets  (jet  mass). 


Mass  Type 

Mass  Range 

N ■ 

^bak 

Nbak 

Nsia 

N si  g 

y/Nbak 

Jet  Cuts  Nj  > 5 

Ri  = 0.7,  Et (jet)  >15  GeV 

Jet  Mass 

> 300  GeV 

364 

444,551 

1,221 

0.55 

Hi(cell)  Cuts 

Et  MO  >5  GeV 

(Jell  Mass 

> 250  GeV 

54 

4,621 

85 

0.80 

Hi{cell)  Cuts 

Rj=0A,  Ericell)  > 5 GeV 

Jet  Mass 

> 300  GeV 

65 

8,138 

125 

0.72 

H({jet)  Cuts 

Ri  =0.4,  ET{jet)>15GeV 

Jet  Mass 

> 300  GeV 

105 

17,578 

168 

0..79 

Hi(jet)  Cuts 

Rj  =0.7,  Erijet)  > 15  GeV 

Jet  Mass 

> 300  GeV 

87 

31,843 

365 

0.49 

after  a jet  multiplicity  cut  there  are  about  360  signal  events  and  roughly  460, 000 
background  events  (in  100/pb)  for  the  reconstructed  mass  range  > 300  GeV. 
The  background  is  about  a factor  of  1, 200  times  larger  than  the  signal. 


260 


Fig.  8.9  shows  the  top-pair  invariant  mass  reconstructed  from  the  “fat”  jets  in 
the  event  (with  Erijet)  > 15  GeV)  for  the  top-pair  signal  (multiplied  by  200)  and 
the  QCD  multi-jet  background  after  a jet  multiplicity  cut  {Nj  > 5).  A problem 
that  arises  when  using  a jet  multiplicity  cut  is  that  the  cut  causes  an  artificial 
peak  in  the  background  invariant  mass  near  the  peak  in  the  signal.  Requiring 
a minimum  number  of  jets  with  transverse  energy  greater  than  15  GeV  removes 
events  with  low  parton-parton  invariant  mass.  In  addition,  jet  multiplicity  cuts  are 
“quantized”  (i.e.,  discrete).  One  cannot  smoothly  vary  the  degree  of  the  cut  to, 
for  example,  optimize  signal  over  background  unless  this  cut  is  a gene  as  well  and 
optimized  simultaneously  with  the  cuts  on  the  modified  Fox- Wolfram  coefficients. 


Figure  8.9:  Shows  the  reconstructed  top-pair  invariant  mass,  Mtt,  for  175  GeV  top  quarks 
produced  in  1.8  TeV  proton-antiproton  collisions  together  with  the  QCD  multi-jet  background 
for  events  that  have  survived  the  jet  multiplicity  cut,  Nj  > 5.  The  invariant  mass  is  constructed 
from  all  the  jets  [Rj  =0.7)  in  the  event  with  Ex{j^t)  > 15  GeV.  The  plot  shows  the  number  of 
events  (with  C — 100/pb)  in  a 50  GeV  bin.  The  top-pair  signal  has  been  multiplied  by  a factor 
of  200. 


261 


8.5.2  Using  Hi  Cuts  Without  Jets 

We  will  now  examine  a method  for  isolating  the  top-pair  signal  over  the  back- 
ground without  defining  jets  at  all.  The  calorimeter  cell  information  is  used  di- 
rectly to  select  the  events  and  to  reconstruct  the  top-pair  invariant  mass.  The 

/V  /N 

six  modified  Fox- Wolfram  moments  Hi,.  ,.,Hq  constructed  from  the  calorimeter 
cells,  Hi{cells),  are  used  to  select  events.  Events  are  required  to  lie  in  a region  of 
.^f-space  defined  by  Li  < Hi  < Ri  for  £ = 1, . . . , 6.  The  left,  Li,  and  right,  Ri,  cuts 
given  in  Table  8.2  were  determined  from  our  genetic  algorithm  procedure  which 
maximizes  the  signal  over  the  square  root  of  the  background.  No  jet  multiplicity 
cuts  are  made. 

Fig.  8.10  shows  the  top-pair  invariant  mass  reconstructed  directly  from  the 
calorimeter  cells  (with  Ericell)  > 5 GeV)  for  the  top-pair  signal  (multiplied  by 
200)  and  the  QCD  multi-jet  background  after  the  Hi  cuts.  Table  8.3  shows  that 
for  the  reconstructed  mass  range  Mu  > 250  GeV  there  are  about  50  signal  events 
and  roughly  5,000  background  events  (in  100/pb).  Here  the  background  is  about 
a factor  of  100  larger  than  the  signal.  For  the  top-pair  signal,  the  invariant  mass 
reconstructed  from  cells  with  Ericell)  > 5 GeV  peaks  at  about  275  GeV  which 
is  less  than  the  true  top-pair  mass  of  350  GeV.  Removing  cells  with  transverse 
energy  less  than  5 GeV  reduces  the  reconstructed  mass  from  its  generated  value. 
Nevertheless,  this  method  gives  our  best  statistical  significance  of  0.8.  One  is  look- 
ing for  a bump  above  a smoothly  falling  background  and  it  does  not  matter  if  the 
mass  is  shifted  downward.  One  can  always  correct  the  mass  after  one  establishes 
the  signal. 


262 


ET(cel)  > 5 G«V 


18^  j 
160»  - 
14J000  - 


2 12/Xn  - 
8 

■*  8«» - 


10^  - 


lU 


6W) 
4^  • 
2XM0  - 
0 


25 


Cell  Invariant  Mass 


175  G«V  Top  Ouar1< 

18  TeV  Proton-Antiproton  Collisions 


After  HL(cell)  cuts 


275  325  375 

Invariant  Mass  (GeV) 


425 


475 


525 


575 


625 


I Top  Signal  x 200  Q QCO  Jets  Background 


Figure  8.10:  Shows  the  reconstructed  top-pair  invariant  mass,  M(f,  for  175  GeV  top  quarks 
produced  in  1.8  TeV  proton-antiproton  collisions  together  with  the  QCD  multi-jet  background 
for  events  that  have  survived  the  H({cell)  cuts.  Events  are  required  to  have  H('s  in  the  region 
L(  < Ht  < R(  ioT  £ = 1, . . . ,6,  where  the  left,  Lt,  and  right,  R(  cuts  are  given  in  Table  8.2.  The 
Hi{cell)'s  and  the  invariant  mass  are  constructed  directly  from  the  calorimeter  cells  using  all 
cells  in  the  event  with  Exicell)  > 5 GeV.  Jets  are  never  defined  and  no  jet  multiplicity  cuts  are 
made.  The  plot  shows  the  number  of  events  (with  C — 100/pb)  in  a 50  GeV  bin.  The  top-pair 
signal  has  been  multiplied  by  a factor  of  200. 


Furthermore,  the  use  of  this  method,  in  principle,  does  not  cause  an  artificial 
peak  in  the  reconstructed  invariant  mass  for  the  background.  There  is  a peak  in 
the  background  in  Fig.  8.10  but  it  is  at  much  lower  mass  than  the  signal  and  could 
be  eliminated  altogether  by  lowering  the  minimum  cell  transverse  energy  of  5 GeV. 

After  the  events  have  been  selected  using  the  Hi{celiys,  one  can  construct 
and  examine  the  jets  in  the  event.  Fig.  8.11  shows  the  multiplicity  of  “fat”  jets 
{Rj  = 0.7)  with  transverse  energy  greater  than  15  GeV  for  the  top-pair  signal  and 
the  QCD  multi-jet  background  for  events  that  have  survived  the  He{cell)  cuts.  By 
selecting  events  that  lie  in  the  region  of  Hi  space  given  in  Table  8.2,  we  have  selected 
events  with  a large  number  of  jets,  but  in  a smooth  way.  The  background  now  peaks 
at  five  jets  instead  of  the  two  jet  peak  in  Fig.  8.8  and  the  signal  and  background  jet 


263 


multiplicities  now  look  similar.  Fig.  8.12  shows  the  top-pair  invariant  mass,  M^i, 
reconstructed  from  jets  for  the  top-pair  signal  and  the  QCD  multi-jet  background 
for  events  that  have  survived  the  Ht{cell)  cuts.  The  invariant  mass  is  constructed 
from  all  the  “narrow”  jets  (i?j  = 0.4)  in  the  event  with  EriJet)  > 15GeV.  No  jet 
multiplicity  cuts  are  made.  Here  the  invariant  mass  of  the  signal  peaks  at  around 
325  GeV  and  Table  8.3  shows  that  the  statistical  significance  is  only  slightly  lower 
than  the  cell  invariant  mass  case. 


RJ  - 0.7  ETU«t)  > IS  GtV 


45%  j 
40%  •• 

35%  -• 

"I  30%  ■■ 

“ 25%-- 

I 20’^  - 

® 15%  ■■ 

CL 

10%  ■■ 

5%  ■■ 

0%  -I 1 

1 2 


Multiplicity  of  Jets 


176  G«V  Top  Quark 
1.8  TfV  Proton-Antiproton  Collioion* 


3456789  10 


Number  of  Jets 

■ Top  Signal  QQCD  Jata  Background 


Figure  8.11:  Shows  the  multiplicity  of  “fat”  jets  (Rj  = 0.7)  with  transverse  energy 
greater  than  15  GeV  for  the  top-pair  signal  and  the  QGD  multi-jet  background  for 
events  that  have  survived  the  Hi{cell)  cuts.  Events  are  required  to  have  Hi{celiys 
in  the  region  Lt  < Hi(cell)  < R(  for  £ = 1, . . . ,0,  where  the  left,  Li,  and  right,  R^ 
cuts  are  given  in  Table  8.2.  The  plot  shows  the  percentage  of  events  with  N jets 
with  Erijet)  > 15  GeV. 


8.5.3  Using  Hi  Cuts  With  Jets 

Instead  of  working  with  the  cells  directly,  one  can  define  jets  from  the  very 
beginning  and  do  the  whole  analysis  with  the  jets.  The  six  moments  modified  Fox- 
Wolfram  moments  Hi, . . . ,Hq  constructed  from  the  jets,  Hi{jet),  are  used  to  select 


264 


R|  - 0.4  ETUtt)  > IS  G«V 

16«»  T 


14.000  - 
12000  - 

2 

j 10000  - 

8 8000  - 

.1 


UJ 


6000  - 
4000  - 
2000  - 
0 


Multi-Jet  Invariant  Mass 


176  GtVTop  Quark 
1 .8  TaV  Proton-Antlproton  Colllaiona 


Aftar  HL(call)  cuts 


275  325  375  425 

Invariant  Mass  (GeV) 


475 


525 


575 


625 


■ Tap  Signal  x 200  Q QCO  Jats  Background 


Figure  8.12:  Shows  the  reconstructed  top-pair  invariant  mass,  Mu,  for  175  GeV  top 
quarks  produced  in  1.8  TeV  proton-antiproton  collisions  together  with  the  QCD 
multi-jet  background  for  events  that  have  survived  the  Hi{cell)  cuts.  Events  are 
required  to  have  Hi{celiys  in  the  region  < He  ^ for  C — 1, . . . , 6,  where  the 
left,  Lt,  and  right,  Ri  cuts  are  given  in  Table  8.2.  The  He{celiys  are  constructed 
from  all  the  cells  in  the  event  with  E-ricell)  > 5 GeV  and  the  invariant  mass  is 
constructed  from  all  the  jets  {Rj  = 0.4)  in  the  event  with  Erijet)  > 15  GeV. 
No  jet  multiplicity  cuts  are  made.  The  plot  shows  the  number  of  events  (with 
C = 100/pb)  in  a 50  GeV  bin.  The  top-pair  signal  has  been  multiplied  by  a factor 
of  200. 


265 


events.  Events  are  required  to  lie  in  a region  of  .^^-space  defined  by  Le  < He  < Re 
for  £ = 1, . . . , 6.  Table  8.2  gives  the  left,  Le,  and  right,  Re,  cuts  determined  from 
the  genetic  algorithm  procedure  to  maximize  the  signal  over  the  square  root  of  the 
background.  The  results  for  both  “narrow”  jets  (Rj  = 0A)  and  “fat”  (Rj  — 0.7)  jet 
is  given  in  Table  8.3.  The  “narrow”  jets  produce  better  results  than  the  “fat”  jets. 

Fig.  8.13  shows  the  top-pair  invariant  mass  reconstructed  from  the  jets  (with 
> 15  GeV  and  Rj  = 0.4)  for  the  top-pair  signal  (multiplied  by  200)  and 
the  QCD  multi-jet  background  after  the  He{jet)  cuts.  Table  8.3  shows  that  for  the 
reconstructed  mass  range  Ma  > 300  GeV  there  are  about  100  signal  events  and 
roughly  17,000  background  events  (in  100/pb).  The  background  is  about  a factor 
of  170  larger  than  the  signal  which  is  comparable  to,  but  slightly  worse  than,  what 
we  get  from  using  the  cells  directly.  This  indicates  that  using  the  cells  directly, 
instead  of  defining  jets,  is  a valid  approach  that  even  has  the  potential  to  give 
superior  results  in  certain  applications. 

8.6  Summary 

It  is  difficult  to  completely  isolate  the  six-jet  decay  mode  of  top-pair  production 
over  the  QCD  multi-jet  background  at  hadron  colliders  without  b-quark  tagging. 
We  are  able  to  reduce  the  background  over  the  signal  to  less  than  a factor  of 
100  using  purely  topological  methods  and  without  the  use  of  b-quark  tagging.  B- 
quark  tagging  would,  of  course,  further  enhance  the  signal  to  background  ratio. 
The  technique  presented  in  this  chapter  can  be  summarized  as  follows: 

• Construct  six  modified  Fox- Wolfram  Moments,  Hi,...,  H^,  directly  from  the 


calorimeter  cells  or  from  jets. 


266 


Multi-Jet  Invariant  Mass 

Ri  * 0.4  ETOat)  > 16  GaV 


20ff00  T 


175  GaV  Top  Quark 
1.8  TaV  Proton-AntIproton  CoHlalona 


a 

2 

i 


8 

.1 


15W)  ■ 


Aftar  HLUat)  R|<a0.4  cuts 


75  125  175  225  275  325  375  425  475  525  575  625 


low 


■ 


Invariant  Mass  (GaV) 


■ Top  Signal  x 200  Q QCO  Jets  Background 


Figure  8.13:  Shows  the  reconstructed  top-pair  invariant  mass,  Mti,  for  175  GeV  top 
quarks  produced  in  1.8  TeV  proton-antiproton  collisions  together  with  the  QCD 
multi-jet  background  for  events  that  have  survived  the  Hi{jet)  cuts.  Events  are 
required  to  have  Hi{jetys  in  the  region  Li  < He{jet)  < Ri  for  £ ^ 1, ...  ,6, 
where  the  left,  Li,  and  right,  Ri  cuts  are  given  in  Table  8.2.  The  Hi{jet)'s  and 
the  invariant  mass  are  constructed  from  all  the  jets  {Rj  = 0.4)  in  the  event  with 
Erijet)  > 15  GeV,  but  no  jet  multiplicity  cuts  are  made.  The  plot  shows  the 
number  of  events  (with  C = 100/pb)  in  a 50  GeV  bin.  The  top-pair  signal  has 
been  multiplied  by  a factor  of  200. 


267 


• Select  events  that  lie  in  a certain  region  of  ^^-space  defined  by  < H(  < R( 

^OT  £ = 1, . . . , 6. 

• Determine  the  left,  and  right,  R^,  cuts  using  genetic  algorithms  that 
maximizes  the  signal  over  the  square  root  of  the  background. 

• Construct  the  top-pair  invariant  mass,  directly  from  the  calorimeter 
cells  or  from  jets. 

We  do  not  make  a jet  multiplicity  cut.  Jet  multiplicity  cuts  cause  an  artifi- 
cial peaking  of  the  background  invariant  mass  near  the  2Mtop  peak  of  the  signal, 
whereas  requiring  events  to  lie  in  a region  of  six-dimensional  .^^-space,  in  principle, 
does  not.  Requiring  the  His  to  be  small  does  select  events  with  a large  number 
of  jets,  but  in  a smooth  way.  Also,  cuts  can  be  continuously  varied,  where  jet 
multiplicity  cuts  are  discrete.  Furthermore,  the  modified  Fox- Wolfram  moments, 

/N  ys 

Hi, . . . , Hq,  can  be  constructed  directly  from  the  calorimeter  cells  without  the  need 
to  define  jets. 

We  have  used  the  six-jet  decay  mode  of  top-quark  pair  production  hadron 
colliders  as  an  example  of  our  techniques.  Other  parton-parton  subprocesses  can 
be  isolated  by  selecting  the  regions  of  JY^-space  that  correspond  to  their  unique 
topology.  For  example,  many  super-symmetric  subprocesses  have  characteristic 
event  topologies  where  our  techniques  should  also  help  to  improve  the  signal  to 
background  ratio. 

Finally,  the  techniques  presented  here  illustrate  that  genetic  algorithms  can 
be  successfully  applied  to  high  energy  physics  phenomenology.  They  are  more 


268 


likely  to  find  a better  solution  in  a high-dimensional  parameter  space  than  a local 
algorithm. 


CHAPTER  9 

TAGGING  OF  TAU  JETS 


There  are  many  high  energy  physics  decay  processes  that  lead  to  the  production 
of  r leptons.  Unlike  electrons  and  muons,  however,  the  r leptons  are  heavy  and 
very  unstable  and  quickly  decay  into  other  particles  that  generally  do  not  leave 
the  calorimeter.  As  a consequence,  r leptons  appear  as  jets — a localized  deposit 
of  energy  in  several  adjacent  calorimeter  cells. 

The  goal  of  this  chapter  is  to  present  a method  which  allows  us  to  differentiate 
between  jets  originating  from  taus  and  regular  hadronic  jets.  This  procedure  is 
called  tagging.  Such  information  is  very  useful  since  the  presence  of  r particles 
implies  leptonic  decay  of  the  process  under  consideration  and,  therefore,  would 
allow  us  to  better  separate  the  signal  from  the  ordinary  QCD  background  which 
is  visible  as  hadronic  jets. 

Tagging  of  particles  is  usually  just  one  component  of  the  overall  signal  to  back- 
ground enhancement  procedure.  For  example,  tagging  could  be  used  as  a first  step 
of  background  elimination  followed  by  other  techniques  such  as  neural  networks 
or  just  plain  cuts  on  certain  variables.  Here  we  are  only  interested  in  the  tagging 
procedure  and  not  in  a specific  application  of  it. 

The  results  in  this  chapter  are  preliminary.  Although  the  work  is  sill  under 
progress,  we  will  be  able  to  illustrate  the  techniques  used  to  tag  tau  jets.  In 
particular,  we  will  utilize  both  multi-dimensional  linear  cuts  and  neural  networks 


269 


270 


trained  by  genetic  algorithms.  These  techniques  can  be  easily  adapted  to  tagging 
of  other  particles. 


9.1  Event  Generation  and  Representation  of  Jets 

IS  A JET  version  7.06  is  used  to  generate  isolated  quark  and  tau  jets  with  trans- 
verse energy  Et  > 10  GeV.  It  should  be  stressed  that  a complete  analysis  would 
also  involve  the  inclusion  of  gluon  jets  as  well  as  other  particles.  However,  at  this 
stage  we  are  only  interested  whether  the  application  of  our  methods  can  differen- 
tiate between  pure  quark  jets,  which  we  call  the  background,  and  pure  tau  jets, 
which  we  call  the  signal. 

Events  are  analyzed  by  dividing  the  solid  angle  into  “calorimeter”  cells  having 
size  Ar]A.(f>  = 0.1  x 7.5°,  where  rj  and  (j)  are  the  pseudorapidity  and  azimuthal  angle, 
respectively.  Our  simple  calorimeter  covers  the  range  \r]\  < 4 and  has  3840  cells. 
A single  cell  has  an  energy  (the  sum  of  the  energies  of  all  the  particles  that  hit  the 
cell  excluding  neutrinos)  and  a direction  given  by  the  coordinates  of  the  center  of 
the  cell.  The  transverse  energy,  Et,  of  each  cell  is  computed  from  the  cell  energy 
and  direction.  We  have  taken  the  energy  resolution  to  be  perfect,  which  means 
that  the  only  resolution  effects  are  caused  by  the  lack  of  spatial  resolution  due  to 
the  cell  size. 

We  define  jets  as  any  5x5  square  of  cells  in  the  rj  — (f>  plane  which  satisfies  the 
condition 

25 

Erijet)  = Y,{ET)i  > 10  GeV 

i=l 


(9.1) 


271 


where  {ET)i  is  the  transverse  energy  of  one  of  the  25  cells  in  the  5 x 5 square.  We 
also  make  sure  to  not  double  count  high-energy  jets  by  seeking  the  cell  with  the 
highest  transverse  energy  and  putting  it  at  the  center  of  the  square. 

Since  both  quark  jets  and  tau  jets  can  have  very  different  energies  and  since 
we  are  interested  in  this  first  step  of  analysis  to  find  a pattern  that  is  independent 
of  the  energy,  we  normalize  the  total  energy  of  each  jet  to  1.  In  addition,  we  let 
ISAJET  emit  the  fraction  of  electromagnetic  energy  (EM)  for  each  cell.  Therefore, 
the  complete  specification  of  a jet  consists  of  the  25  transverse  energies  for  each 
cell  (which  are  normalized  to  1)  and  their  corresponding  25  fractions  of  EM  energy. 

Due  to  the  choice  of  the  cell  size  in  rj  and  (j)  direction,  we  have  an  approximate 
cylindrical  symmetry  in  the  r}  — (j)  plane  with  respect  to  the  central  cell  of  the  5x5 
square.  To  account  for  this  symmetry,  we  introduce  6 layers  as  shown  in  Fig.  9.1. 
The  first  layer  is  the  central  cell.  Each  subsequent  layer  consists  of  cells  that  are 


6 

5 

4 

5 

6 

5 

3 

2 

3 

5 

4 

2 

1 

2 

4 

5 

3 

2 

3 

5 

6 

5 

4 

5 

6 

Figure  9.1:  Shows  the  layers  in  a jet.  The  central  cell  has  the  highest  transverse 
energy  and  it  is  layer  1.  Each  other  layer  contains  cells  that  are  equal  distance 
away  from  the  central  cell. 

equally  distant  from  the  central  cell.  In  this  representation,  each  jet  is  specified 
by  the  six  numbers  corresponding  to  the  total  transverse  energy  in  each  layer  and 
the  six  fractions  of  EM  energy  in  those  layers. 

Tables  9.1  and  9.2  illustrate  the  average  normalized  transverse  energy  and  the 


272 


Table  9.1:  Shows  the  average  distributions  per  cell  of  the  transverse  energy  and 
the  fraction  of  EM  energy  for  tau  jets  based  on  855  events. 


Average  Normalized  Transverse  Energy  Per  Cell 

Average  Fraction  of  EM  Energy  Per  Cell 

1.3e-05 

0.00018 

0.0016 

0.00015 

0 

1.3e-05 

2.4e-06 

0.00066 

0.00015 

0 

0.0011 

0.0098 

0.028 

0.01 

0.0014 

0.00036 

0.0042 

0.013 

0.0052 

0.00089 

0.004 

0.051 

0.79 

0.04 

0.0029 

0.0007 

0.023 

0.28 

0.022 

0.0016 

0.0016 

0.012 

0.03 

0.0099 

0.0017 

0.00022 

0.0051 

0.014 

0.0053 

0.00039 

0.00052 

0.00049 

0.0015 

0.001 

0.00034 

0.00043 

5.5e-05 

0.00089 

0.00048 

0.00018 

fraction  of  EM  energy  for  tau  and  quark  jets.  These  figures  imply  that  the  tau 
jets  are  more  narrow  than  the  quark  jets  and  that  they  have  a larger  proportion 
of  EM  energy  per  cell.  These  two  observations  suggest  that  a separation  between 
these  two  types  of  jets  should  be  possible  even  if  using  linear  cuts.  It  should  also 
be  emphasized  that  should  the  two  distributions  have  looked  alike,  one  should  not 
have  drawn  the  conclusion  that  separation  is  impossible  because  the  domain  of 
applicability  of  linear  statistics  is  much  smaller  than  that  of  neural  networks,  for 
example. 

Table  9.2:  Shows  the  average  distributions  per  cell  of  the  transverse  energy  and 
the  fraction  of  EM  energy  for  quark  jets  based  on  758  events. 


Average  Normalized  Transverse  Energy  Per  Cell 

Average  Fraction  of  EM  Energy  Per  Cell 

0.0052 

0.0077 

0.0078 

0.0085 

0.0056 

0.0015 

0.0026 

0.0023 

0.0023 

0.0016 

0.0094 

0.028 

0.054 

0.029 

0.011 

0.0031 

0.0082 

0.013 

0.0084 

0.0022 

0.015 

0.061 

0.52 

0.061 

0.015 

0.0033 

0.018 

0.11 

0.018 

0.0036 

0.011 

0.027 

0.056 

0.027 

0.011 

0.0032 

0.0069 

0.014 

0.0068 

0.0032 

0.0051 

0.0084 

0.011 

0.0056 

0.0045 

0.0016 

0.0024 

0.0021 

0.0014 

0.0017 

Tables  9.3  and  9.4  show  the  average  normalized  transverse  energy  and  EM  en- 
ergy fraction  per  layer  for  tau  and  quark  jets  respectively.  As  a consequence  of  the 
approximate  cylindrical  symmetry,  these  variables  have  the  same  implications — 


273 


Table  9.3:  Shows  the  average  distributions  per  layer  of  the  transverse  energy  and 
the  fraction  of  EM  energy  for  tan  jets  based  on  855  events. 


Layer 

1 

2 

3 

4 

5 

6 

Average  Normalized  Transverse  Energy  Per  Layer 

0.79 

0.15 

0.042 

0.01 

0.0076 

0.00086 

Average  Fraction  of  EM  Energy  Per  Layer 

0.28 

0.071 

0.02 

0.0038 

0.0025 

0.00063 

the  tan  jets  are  more  narrow  and  have  more  EM  energy.  In  order  to  reduce  the 

Table  9.4:  Shows  the  average  distributions  per  layer  of  the  transverse  energy  and 
the  fraction  of  EM  energy  for  quark  jets  based  on  758  events. 


Layer 

1 

2 

3 

4 

5, 

6 

Average  Normalized  Transverse  Energy  Per  Layer 

0.52 

0.23 

0.11 

0.049 

0.073 

0.02 

Average  Fraction  of  EM  Energy  Per  Layer 

0.11 

0.063 

0.03 

0.011 

0.02 

0.0065 

number  of  input  variables  and  at  the  same  time  to  account  for  the  approximate 
cylindrical  symmetry,  we  will  use  the  energies  corresponding  to  the  6 layers  instead 
of  the  individual  cell  energies. 

9.2  Tagging  with  Multi-Dimensional  Linear  Cuts 

We  first  attempt  to  differentiate  between  tau  and  quark  jets  by  using  multi- 
dimensional linear  cuts  in  conjunction  with  the  statistical  significance  measure, 
Rf,  defined  in  Chapter  3.  We  will  use  four  different  sets  of  input  variables  derived 
from  the  set  of  12  numbers  that  specify  each  jet. 

The  first  set  of  input  variables  consists  of  the  2 outer-most  major  layers  2'  and 
3'  of  transverse  energy  as  shown  in  Fig.  9.2.  In  essence,  we  have  combined  layers 
2 and  3 into  the  single  major  layer  2'  and  layers  4,  5 and  6 into  the  single  major 


274 


3’ 

3’ 

3’ 

3’ 

3’ 

3’ 

2’ 

2’ 

2’ 

3’ 

3’ 

2’ 

r 

2’ 

3’ 

3’ 

2’ 

2’ 

2’ 

3’ 

3’ 

3’ 

3’ 

3’ 

3’ 

Figure  9.2:  Shows  the  definition  of  major  layers  for  a 5 x 5 set  of  calorimeter  cells. 
We  have  combined  layers  2 and  3 into  major  layer  2'  and  layers  4,  5 and  6 into 
major  layer  3'.  Layer  1 is  the  same  as  major  layer  1'. 

layer  3'.  The  set  of  input  variables,  therefore,  is  the  total  normalized  transverse 
energy  for  layers  2'  and  3'  which  we  denote  by  and  E^. 

The  second  set  of  variables  consists  of  the  normalized  total  transverse  energy 
in  each  of  the  six  layers  which  we  denote  by  E\,,  E^,  E^,  E^,  E^  and  E^. 

The  third  set  of  input  variables  extends  the  second  set  by  adding  the  total 
fraction  of  EM  energy,  F^lj,  which  is  defined  as 

25 

^EM  — '^{FEM)i  (9.2) 

i=\ 

where  {FEM)i  is  the  fraction  of  EM  energy  in  cell  i.  Thus,  the  full  set  of  input 
variables  for  the  third  case  is  E’j.,  E^,  E^,  E^,  E^,  E^  and 

The  fourth  set  of  input  variables  consists  of  the  normalized  total  transverse 
energy  in  each  of  the  six  layers,  E^,  E^,  E^,  E^,  E^,  E^,  and  the  corresponding 
fraction  of  EM  energy  in  each  of  those  layers  which  we  denote  by 

^EMi  ^EM  ^EM- 

We  use  genetic  algorithms  to  find  the  optimal  set  of  left  and  right  cuts  for  each 
set  of  variables  so  that  the  performance  measure  Rj  is  maximized.  The  multi- 
dimensional cuts  are  represented  by  the  LinearCuts  species  described  in  Section 


275 


5.5.  The  best  results  for  the  four  different  sets  of  input  variables  are  shown  in  Table 

9.5.  It  is  clear  that  the  two  outer  major  layers  do  not  provide  enough  information 


Table  9.5:  Separation  of  tau  and  quark  jets  for  four  different  sets  of  input  variables 
using  multi-dimensional  linear  cuts  trained  by  genetic  algorithms.  S and  B are  the 
number  of  tau  and  quark  jets  in  the  data  sample.  Ng  and  are  the  number  of 
surviving  signal  and  background  events.  Fgnh,  F^/f  and  Rf  are  the  enhancement, 
the  efficiency  and  the  statistical  significance  measure. 


Input  Variables 

s 

B 

Ns 

Nb 

Eenh 

Neff 

Rj 

/?2'  ^3' 

H/rp  j H/rp 

1000 

1000 

515 

84 

6.1 

0.52 

3.16 

856 

751 

278 

1 

243.9 

0.33 

79.21 

j^l  rp6  jptot 

• • • ? ^ EM 

856 

759 

297 

1 

263.3 

0.35 

91.37 

Et^  . . . , E^^  • • • ? ^EM 

856 

759 

299 

3 

88.4 

0.35 

30.87 

for  good  separation  of  tau  jets  from  quark  jets.  When  we  use  all  6 layers,  the 
performance  improves  significantly.  On  the  other  hand,  adding  the  fraction  of 
total  EM  energy  of  the  jet,  F^h,  had  little  effect  on  the  performance  which  seems 
surprising  since  tau  jets  have,  at  least  in  average,  more  EM  energy.  The  fourth 
result  is  even  more  surprising — the  performance  actually  decreased  even  though 
we  used  the  EM  energy  for  all  six  layers. 

The  reason  for  the  performance  degradation  in  the  last  case  has  two  origins: 

1.  The  intrinsic  limitations  of  linear  cuts  as  a means  of  signal  enhancement. 

2.  The  minimum  mutation  rate  for  the  LinearCuts  species  is  zero  and,  as  a 
consequence,  a population  can  lose  its  genetic  diversity. 

In  the  first  case,  linear  cuts  seem  to  be  inadequate  as  a model  which  indicates 
that  the  additional  leptonic  information  is  not  nicely  contained  in  a hyper-cube 


276 


for  the  tau  jets  and  outside  that  cube  for  the  quark  jets.  The  second  factor,  the 
LinearCuts  species  limitation,  results  into  a more  local  algorithm  which  makes  it 
more  difficult  to  find  better  solutions  in  the  long  run.  As  already  discussed  in 
Section  5.5,  the  LinearCuts  species  are  defined  to  be  useful  in  smaller  dimensions 
of  the  parameter  space.  For  example,  the  dimension  of  the  parameter  space  for 
the  last  data  set  is  24. 

Another  feature  of  the  results  for  the  second  and  third  set  of  input  variables 
is  that  only  1 background  event  survived  the  cuts.  This  indicates  that  there  is  a 
solution  with  0 background  events  surviving.  We  do  not  allow  such  solutions  since 
in  general  there  is  always  a small  hyper-cube  around  a signal  event  that  does  not 
contain  any  background  events.  Since  the  Rf  measure  is  infinity  in  such  cases, 
were  we  to  accept  solutions  with  0 background  events  in  general,  we  would  always 
be  finding  exactly  those  “spurious”  solutions.  It  is  a different  matter  that  if  we 
already  know  that  there  are  entire  regions  containing  a significant  fraction  of  signal 
events  and  no  background  events  at  all,  we  might  change  our  measure  and  look  for 
those  regions.  In  that  case  we  would  be  maximizing  the  efficiency  subject  to  the 
constraint  that  there  are  no  background  events  surviving. 

9.3  Tagging  with  Neural  Networks 

We  now  repeat  our  analysis  by  using  neural  networks  trained  by  genetic  algo- 
rithms. We  use  the  same  four  sets  of  input  variables.  The  performance  measure  is 
also  the  same. 

The  neural  networks  are  represented  by  the  NeuralNetworks  species  defined  in 
Section  5.6.  We  use  2 hidden  layers  of  size  2N  and  N where  N is  the  number  of 


277 


input  variables  for  each  data  set.  For  example,  in  the  first  case  we  use  2 — 4 — 2 — 1 
neural  networks  and  in  the  fourth  case  we  use  12-24-12-1  neural  networks. 

Since  the  NeuralNetworks  species  do  not  use  any  embedded  local  algorithms 
and  since  their  mutation  rate  can  never  go  to  zero,  we  perform  a single  run  on 
each  set  of  data  and  let  the  neural  networks  evolve  for  as  long  as  it  is  feasible. 
The  results  from  these  runs  are  presented  in  Table  9.6.  As  can  be  seen  from  the 

Table  9.6:  Separation  of  tau  and  quark  jets  for  four  different  sets  of  input  variables 
using  neural  networks  trained  by  genetic  algorithms.  S and  B are  the  number  of 
tau  and  quark  jets  in  the  data  sample.  Ng  and  are  the  number  of  surviving  signal 

and  background  events.  Fg„/i,  Fg//  and  Rf  are  the  enhancement,  the  efficiency  and 
the  statistical  significance  measure. 


Input  Variables 

s 

B 

Ns 

Nb 

Eenh 

Feff 

R) 

tp2'  jp3' 

H/rp  j JZjrp 

1000 

1000 

521 

91 

5.7 

0.52 

2.98 

856 

751 

261 

1 

229.0 

0.31 

69.82 

zpl  zp6  jptot 

• • • ? ^ EM 

856 

759 

370 

1 

328.1 

0.43 

141.81 

Et^  . . . , E^,  • • • 5 F^em 

856 

759 

526 

1 

466.4 

0.61 

286.59 

table,  we  have  a steady  improvement  of  the  performance  with  the  number  of  input 
variables.  This  is  distinctly  different  from  the  linear  cuts  case  where  the  leptonic 
energy  cases  did  not  give  us  an  improvement.  We  can  draw  several  conclusions 
from  these  results: 

1.  Neural  Networks  are  indeed  superior  model-free  discriminators.  They  suc- 
cessfully utilized  the  leptonic  information  whereas  linear  cuts  were  unable  to 
do  that. 


278 


2.  Genetic  Algorithms  are  capable  of  maximizing  over  large-dimensional  spaces. 
For  example,  the  dimension  of  the  parameter  space  for  the  fourth  set  of  data 
is  625!  It  is  truly  astonishing  that  the  population  of  neural  networks  was  able 
to  undergo  many  rearrangements  during  its  evolution.  These  rearrangements 
originate  from  mutation  and  thus  they  are  global  in  nature.  The  best  solution 
corresponds  to  the  case  of  the  largest  parameter  space.  The  performance 
of  traditional  global  algorithms  generally  degrades  exponentially  with  the 
dimension  of  the  parameter  space.  This  does  not  seem  to  be  the  case  with 
genetic  algorithms.  As  a consequence,  we  can  safely  conclude  that  genetic 
algorithms  are  a superior  tool  for  complex  maximization  problems. 

3.  The  results  for  the  first  two  cases  are  similar  to  the  results  from  linear  cuts. 
This  indicates  that  linear  cuts  are  sufficient  in  those  cases. 

4.  The  neural  networks  also  discovered  solutions  with  one  background  event 
only.  This  is  a further  indication  that  one  should  be  able  to  find  solutions 
with  no  background  surviving  at  all.  However,  we  should  keep  in  mind  that 
we  are  dealing  with  pure  samples  at  this  stage.  When  additional  particle 
events  are  added  to  the  sample,  it  may  no  longer  be  the  case  that  all  the 
background  can  be  removed. 

In  summary,  by  using  neural  networks  trained  with  genetic  algorithms  in  con- 
junction with  the  statistical  significance  measure,  we  were  able  to  remove  virtually 
all  quark  jets  while  retaining  about  60%  of  the  tau  jets.  The  leptonic  information 
was  necessary  to  achieve  this.  Linear  cuts  were  not  sufficient  in  this  case. 


279 


As  pointed  out  at  the  beginning  of  this  chapter,  these  results  are  preliminary. 
We  only  used  quark  jets  as  background  and  did  not  investigate  gluon  jets  (although 
they  are  very  similar  to  quark  jets  and  we  do  not  expect  any  significant  difference 
in  the  results).  Furthermore,  we  only  used  pure  samples  of  jets.  It  is  not  clear 
how  the  performance  will  change  when  other  particles  are  added  to  the  samples. 
Nevertheless,  we  were  able  to  illustrate  that  neural  networks  trained  by  genetic 
algorithms  give  superior  results  for  the  data  investigated  so  far.  This  methodology 
can  be  applied  directly  to  other  samples  of  data. 


CHAPTER  10 
CONCLUSION 


We  demonstrated  that  neural  networks  and  genetic  algorithms  can  be  applied 
successfully  in  high  energy  physics.  The  application  of  neural  networks  to  the  prob- 
lem of  signal  enhancement  gives  superior  results  compared  to  traditional  methods. 
At  the  same  time,  neural  networks  can  be  used  in  conjunction  with  traditional 
methods  and,  therefore,  they  do  not  replace  the  set  of  available  tools  for  signal 
enhancement  but,  instead,  they  enhance  it. 

Neural  networks  that  use  the  statistical  significance  measure  cannot  be  trained 
by  traditional  training  algorithms.  Genetic  algorithms,  however,  are  capable  of 
training  neural  networks  even  in  this  case.  From  that  point  of  view,  they  are  an 
extension  of  traditional  maximization  algorithms. 

In  addition,  the  genetic  algorithms  presented  in  this  dissertation  are  essentially 
parameterless  and  self-adaptive  to  both  the  problem  being  solved  and  the  stage  of 
evolution.  This  makes  genetic  algorithms  a viable  alternative  to  local  algorithms 
since  they  can  be  both  global  and  efficient.  In  addition,  their  capability  to  embed 
other  algorithms  makes  them  a true  generalization  of  training  algorithms — not  a 
replacement. 

Finally,  both  neural  networks  and  genetic  algorithms  are  universal  tools.  We 
can  always  use  neural  networks  as  a universal  model-free  discriminant  function 
regardless  of  the  nature  of  the  input  data.  Our  genetic  algorithm,  on  the  other 
hand,  can  always  be  used  as  a universal  maximization  algorithm  regardless  of  the 


280 


281 


properties  of  the  parameter  space  or  the  performance  measure.  For  that  reason,  the 
methods  presented  in  this  dissertation  are  readily  applicable  to  other  problems  in 
high  energy  physics.  They  are  by  no  means  limited  to  the  four  specific  applications 
presented  here. 


APPENDIX  A 
NOTATIONS 


A.l  Gamma  Matrices 

In  D space-time  dimensions,  the  gamma  matrices  7°,  7^,  7^,  ...,7^“^  satisfy 
the  anticommutation  relation 


{7^7''}  = 2s'“'/  (A.l) 

where 

= D.  (A.2) 

They  have  the  following  properties 

Tr{I)  = D 
Tr{j>^)  = 0 

Tr  (75)  = 0 

Tr(odd'y)  = 0 
Tr{^^Y)  — Dg^'^ 

Tr{%Y)  = 0 

Tr{a>^'')  = 0 

Tr{'y^YY)  = 0 


282 


283 


Tr(7^7'^7“7^) 


7/^7 


At 


7m7“7'" 


7m7“7^7'" 


0 

+ 9^^ 9''^  - 9^9''^) 

-iDe^''^^^ 

DI 

-{D-2)r 

{D  - 4)7“7^  + 4^“^ 


where  I is  the  D x D unit  matrix. 

In  four  space-time  dimensions,  some  of  the  more  useful  relations  reduce  to 


7^7/^  = 4 


7h'Y°‘Y  = -27“ 


7m7“7^7'"  = 


7^i7**7^7"'*7^  = 


7m7“7^7^7'^7'"  = 2(7'^7“7^7'’'  -I- 


A. 2 The  Levi-Civita  Tensor 


The  totally  antisymmetric  Levi-Civita  tensor  is  defined  as  follows: 


^livpa 


'+1 
^ -1 


if  (//,  p,  p,  a)  is  an  even  permutation  of  (0, 1,  2, 3) 
if  {p,  p,  p,  a)  is  an  odd  permutation  of  (0, 1, 2, 3) 
otherwise 


It  satisfies  the  following  identities: 


p'a'  

—det 

(»“’) 

a 

= V, 

P 

_ / 

1 f 

/ 

a 

II 

P 

^piypcT  ly'p'a'  _ 

^ ^/i  — 

—det 

(»“') 

O' 

II 

a 

_ ! 

/ ! 

! 

a 

= 

a 

= -69; 


^a/Sfiu 


-24. 


APPENDIX  B 
DIRAC  SPINORS 


A Dirac  spinor,  describes  a fermion.  The  Dirac  equation  is  given  by: 


— m)  'ip{x)  = 0 


(B.l) 


where  m is  the  mass  of  the  fermion.  The  Dirac  adjoint  field,  i!){x),  is  defined  as 


ip{x)  = 


(B.2) 


If  A is  a four  vector,  we  define  the  “slash”  notation  as 


4 = 


(B.3) 


The  following  formulas  are  useful  relations  involving  this  notation 


= -24 


= AA-  B 


285 


286 


where  A,  B,  C,and  D are  four-vectors.  The  Dirac  equation  can  also  be  written  as 


{ip- m)'ip{x)  = Q.  (B.4) 

If  u {p,  s)  and  u {p,  s)  are  the  incoming  and  outgoing  waves  for  the  particle  and 
V {p,  s)  and  v {p,  s)  are  the  incoming  and  outgoing  waves  for  the  antiparticle,  then: 

{'j)  — m)u  = u{'p  — m)  — 0 
{•p  + m)v  = V {]/)  + m)  = 0 


u {p,  = 1/2) 

u{p,Sz  = -1/2) 
u (p,s^  = 1/2) 

v{p,  Sz  = -1/2) 


E-\-m 


u{p,s)u{p,s)  = {]^  + m)  ^ (1  + jJ) 
v{p,s)v{p,s)  - {]/>  - m)  ^ {1  + yj) 


^ u {p,  s)u{p,s)  = Tpy-m 


287 


y~l  V {p,  s)v  (p,s)  = ]/>  — m 

S 

u {p,  s)  u {p,  s')  = 2m6ss> 

V {p,  s)  V {p,  s')  = -2m5ss> 

= 2E5ss' 


REFERENCES 


[1]  T.  Cheng  and  L.  Li.  Gauge  Theory  of  Elementary  Particle  Physics.  Oxford 
University  Press,  1991. 

[2]  Francis  Halzen  and  Alan  Martin.  Quarks  and  Leptons.  Jhon  Wiely  and  Sons, 
1984. 

[3]  Lewis  H.  Ryder.  Quantum  Field  Theory.  Cambridge  University  Press,  1985. 

[4]  G.  C.  Fox  and  S.  Wolfram.  Phy.  Rev.  Lett.,  41:1581,  1978. 

[5]  G.  C.  Fox  and  S.  Wolfram.  Nucl.  Phys.,  B149:413,  1979. 

[6]  G.  C.  Fox  and  S.  Wolfram.  Phys.  Lett.,  B82:134,  1979. 

[7]  M.  Tayebnejad  R.  D.  Field,  Y.  Kanev.  Topological  analysis  of  the  top  quark 
signal  and  background  at  hadron  colliders.  Phys.  Rev.  D,  55:5685-5697,  1997. 

[8]  R.  A.  Fisher.  Ann.  Eugenics,  1936. 

[9]  M.  Stinchcombe  K.  Hornik  and  H.  White.  Multilayer  feedforward  networks 
are  universal  approximators.  Neural  Networks,  2:359-366,  1989. 

[10]  Y.  Yao  and  W.  Freeman.  Model  of  biological  pattern  recognition  with  spatially 
chaotic  dynamics.  Neural  Networks,  3:153-170,  1990. 

[11]  Bart  Kosko.  Neural  Networks  and  Fuzzy  Systems.  Prentice  Hall,  Englewood 
Cliffs,  NJ,  1992. 

[12]  R.  F.  Thompson.  The  neurobiology  of  learning  and  memory.  Science, 
August:941-947,  1986. 

[13]  D.  E.  Rumelhart  and  J.  L.  McClelland.  Parallel  Distributed  Processing,  vol- 
ume I.  M.I.T.  Press,  Cambridge,  MA,  1986. 

[14]  Stephen  T.  Welstead.  Neural  Network  and  Fuzzy  Logic  Applications  in 
C/C++.  John  Wiley  & Sons,  Inc.,  Somerset,  N.J.,  1994. 


288 


289 


[15]  J.  H.  Holland.  Adaptation  in  Natural  and  Artificial  Systems.  University  of 
Michigan,  Ann  Arbor,  MI,  1975. 

[16]  D.  E.  Goldberg.  Genetic  Algorithms  in  Search,  Optimization,  and  Machine 
Learning.  Addison- Wesley,  New  York,  1989. 

[17]  T.  A.  Brown.  Genetics.  Chapman  and  Hill,  London,  1995. 

[18]  Christian  de  Duve.  Vital  Dust.  Basic  Books,  New  York,  1995. 

[19]  Garry  Kane  and  Joe  Heinrich.  MIPS  RISC  Architecture.  Prentice  Hall,  En- 
glewood Cliffs,  NJ,  1992. 

[20]  Maragaret  A.  Ellis  and  Bjarne  Stroustrup.  The  Annotated  C++  Reference 
Manual.  Addison- Wesley,  Reading,  MA,  1994. 

[21]  Bjarne  Stroustrup.  The  C++  Programming  Language.  Addison- Wesley,  Read- 
ing, MA,  1995. 

[22]  Andrew  S.  Tanenbaum.  Modern  Operating  Systems.  Prentice  Hall,  Englewood 
Cliffs,  N.J.,  1992. 

[23]  Robert  N.  Cahn.  A higgs  primer.  Preprint  LBL-29789,  1990. 

[24]  Chris  Quigg.  Gauge  Theories  of  the  Strong,  Weak,  and  Electromagnetic  In- 
teractions. Addison- Wesley  Publishing  Company,  1983. 

[25]  DELPHI  Collaboration.  Preprint  CERN-PPE/94-f6/Rev,  1994. 

[26]  R.  D.  Field  and  P.  A.  Griffin.  Phys.  Rev.  D,  48:3167,  1993. 

[27]  M.  Tayebnejad  P.  A.  Griffin  R.  D.  Field,  Y.  Kanev.  Phys.  Rev.  D,  53:2296, 
1996. 

[28]  CMS  Collaboration.  The  compact  muon  solenoid  and  its  physics  at  the  LHC. 
In  10th  Topical  Workshop  on  Proton- Antiproton  Collider  Physics.  Fermilab, 
unpublished,  1995.  invited  talk  presented  by  Thomas  Muller. 

[29]  ATLAS  Collaboration.  The  physics  potential  at  the  ATLAS  experiment  at 
LHC.  In  10th  Topical  Workshop  on  Proton- Antiproton  Collider  Physics.  Fer- 
milab, unpublished,  1995.  invited  talk  presented  by  K.  Jacobs. 

[30]  F.  Abe  et  al.  (The  CDF  Collaboration).  Phy.  Rev.  Lett,  73:225,  1994. 

[31]  F.  Abe  et  al.  (The  CDF  Collaboration).  Phy.  Rev.  Lett,  74:2626,  1994. 


290 


[32]  S.  Abachi  et  al.  (The  DO  Collaboration).  Phy.  Rev.  Lett,  74:2422,  1995. 

[33]  S.  Abachi  et  al.  (The  DO  Collaboration).  Phy.  Rev.  Lett,  74:2632,  1995. 

[34]  LI.  Garrido  LI.  Ametller  and  P.  Talavera.  Preprint  UB-ECM-PF  94/13,  1994. 

[35]  G.  Stimpfl-Abele  P.  Taavera  Li.  Ametller,  Li  Garrido  and  P.  Yepes.  Preprint 
hep-ph/9603269,  1996. 

[36]  R.  D.  Field  and  Y.  Kanev.  Using  collider  event  topology  in  the  search  for  the 
six-jet  decay  of  top  quark-antiquark  pairs.  Preprint  hep-ph/9801318,  UFIFT- 
HEP-97-31,  1997. 


BIOGRAPHICAL  SKETCH 


Youli  Kanev  was  born  in  Sofia,  Bulgaria,  on  the  of  July,  1964.  In  1983,  he 
graduated  with  honors  from  the  German  Language  Gymnasium  in  Sofia.  The  same 
year,  he  won  the  Youth’s  National  Prize  for  Achievement  in  the  field  of  electronics. 
After  serving  for  a year  in  the  army,  he  enrolled  at  the  Institute  for  Electrical 
Engineering  in  Sofia. 

At  the  beginning  of  1985,  he  transferred  to  the  the  University  of  Sofia  and 
enrolled  in  the  program  of  theoretical  physics.  In  1986,  he  began  specializing  in 
the  field  of  nuclear  physics  and  elementary  particle  physics.  During  his  studies, 
he  was  also  conducting  research  at  the  Institute  for  Nuclear  Research  and  Nuclear 
Energy  at  the  Bulgarian  Academy  of  Sciences. 

In  addition  to  his  work  in  the  area  of  physics,  he  also  took  part  in  numerous 
projects  at  the  Institute  for  Technical  Cybernetics  and  Robotics  for  the  develop- 
ment of  computer  systems  for  industrial  process  control.  He  was  involved  in  these 
projects  as  hardware  designer  and  system  programmer. 

In  1989,  he  defended  his  diploma  thesis  in  the  field  of  hydrodynamics  and 
in  October  1989  he  graduated  from  the  University  of  Sofia  with  a diploma  in 
theoretical  physics.  Immediately  after  graduation,  he  took  a research  position  at 
the  Department  of  Physics  at  the  University  of  Sofia. 


291 


292 


In  January  1990,  Youli  Kanev  was  granted  a Fulbright  Scholarship  and  enrolled 
in  the  graduate  program  in  physics  at  the  University  of  Maine.  In  1991,  he  defended 
his  master’s  thesis  in  the  area  of  relativistic  quantum  physics. 

In  August  1991,  he  enrolled  in  the  Ph.D.  program  in  physics  at  the  University  of 
Florida  and  in  1995,  he  enrolled  in  a concurrent  M.S.  program  at  the  Department 
of  Computer  and  Information  Sciences  and  Engineering.  In  1997,  he  received  an 
M.S.  degree  in  computer  and  information  sciences. 

He  is  currently  preparing  his  Ph.D.  dissertation  in  the  area  of  high  energy 
physics  in  which  he  is  applying  artificial  intelligence  for  the  detection  of  elementary 
particles  at  hadron  colliders. 


I certify  that  I have  read  this  study  and  that  in  my  opinion  it  conforms  to 
acceptable  standards  of  scholarly  presentation  and  is  fully  adequate,  in  scope  and 
quality,  as  a dissertation  for  the  degree  of  Doctor  of  Philosophy. 


R.  Field,  Chairman 
Professor  of  Physics 

I certify  that  I have  read  this  study  and  that  in  my  opinion  it  conforms  to 
acceptable  standards  of  scholarly  presentation  and  isiiilly^dequate,  in  scope  and 
quality,  as  a dissertation  for  the  degree  of  Doctor  of  j^h^los 

r 

J.  Ipser 
Professor  of  Phvsics 

I certify  that  I have  read  this  study  and  that  in  my  opinion  it  conforms  to 
acceptable  standards  of  scholarly  presentation  and  is  fully  adequate,  in  scope  and 
quality,  as  a dissertation  for  the  degree  of  Doctor  of  Philosophy. 

I 


P.  Sikivie 

Professor  of  Physics 

I certify  that  I have  read  this  study  and  that  in  my  opinion  it  conforms  to 
acceptable  standards  of  scholarly  presentation  and  is  fully  adequate,  in  scope  and 
quality,  as  a dissertation  for  the  degree  of  Doctor  of  Philosophy. 


R.  Woodard 
Associate  Professor  of  Physics 

I certify  that  I have  read  this  study  and  that  in  my  opinioiyit  conforms  to 
acceptable  standards  of  scholarly  presentation  and  is  fully  adequate,  in  scope  and 
quality,  as  a dissertation  for  the  degree  of  Doctor  of  Philosophy., 


^J.  f^eesiing 

Prqfessor  of  MatAiematics 


This  dissertation  was  submitted  to  the  Graduate  Faculty  of  the  Department  of 
Physics  in  the  College  of  Liberal  Arts  and  Sciences  and  to  the  Graduate  School 
and  was  accepted  as  partial  fulfillment  of  the  requirments  for  the  degree  of  Doctor 
of  Philosophy. 

August  1998 


Dean,  Graduate  School 


